System Maintenance¶
This guide describes the operational checks required to run Excalibur on a Kubernetes cluster deployed with the official Helm chart. The focus is practical day-2 operations — detecting problems early, maintaining stability, and keeping Excalibur services available and observable.
The intended reader is the Excalibur operator — the DevOps, SRE, platform, or system administrator responsible for the Excalibur deployment. This may be a partner managing Excalibur for an end customer, an in-house operations team running Excalibur for their organization, or a managed-service provider. Throughout this guide, operator refers to whichever role owns day-2 operations in your environment.
This guide assumes you operate Excalibur workloads on Kubernetes, but do not necessarily manage the Kubernetes platform itself.
Prerequisites
- Excalibur is deployed on Kubernetes using the official Helm chart.
- You have
kubectlaccess to the Excalibur namespace. - You can sign in to Grafana — see Access Grafana and Prometheus for the supported access methods.
High Availability and What "Down" Means¶
In a production deployment, most Excalibur services run with multiple replicas behind a Kubernetes Service. Kubernetes load-balances requests across healthy pods, restarts failed pods automatically, and reschedules them to other nodes when needed. A single pod restart, eviction, or OOMKilled event is a routine, self-healing condition — not an outage.
A user-visible incident typically requires one of the following:
- All replicas of a stateless service are unhealthy at the same time.
- A stateful component loses quorum or its only replica (for example, the database StatefulSet, the cache, or persistent storage).
- Ingress or TLS is broken — nothing reaches the cluster.
- A shared dependency fails — DNS, identity store, or external network.
When this guide says a service "becomes unavailable," it refers to the service as a whole (every replica unhealthy, or a stateful component down), not a single pod restart. Treat individual pod restarts as signals to investigate trends — frequent restarts indicate an underlying problem long before they cause downtime.
Replica counts are configurable
Replica counts for each service are set in the Helm values. The chart defaults are designed for evaluation; production deployments should size replicas based on expected load, redundancy targets, and node failure tolerance. See the Installation and implementation guide for sizing guidance.
Where to Start¶
-
Excalibur Workloads
Monitor core services, tenant workloads, storage, backups, observability, ingress, and scheduled jobs.
-
Cluster Health
Cluster-level signals that affect Excalibur — node health, resource usage, scheduling, log collection, and DNS.
-
Access Grafana
Grafana is not exposed externally by default. Reach it through
kubectl port-forward, maintenance mode, or explicit Helm exposure. -
Grafana Dashboards
Reference for the three pre-built Excalibur dashboards and how to investigate logs in Explore.
What to Monitor¶
The table below summarizes the operational areas covered by this guide and why each one matters.
| Area | What to monitor | Why it matters |
|---|---|---|
| Core services | Pod health, restarts, readiness, replica count | Sustained replica loss disrupts authentication, sessions, or API access |
| Tenant workloads | Per-tenant pods and session services | Failures affect specific customers or sessions |
| Storage | Persistent volume usage | Exhaustion causes service failures or data loss |
| Backups | Backup jobs and repository capacity | Backups are required for recovery after data loss |
| Observability | Prometheus, Grafana, Loki, fluent-bit | Monitoring must work to detect problems |
| Ingress and TLS | External connectivity and certificates | The platform must remain accessible to users |
| Cluster health | Node status, resource pressure | Cluster instability often surfaces as application failures |
| Pod scheduling | Pending pods, scheduling errors | Failed scheduling prevents new or restarted pods from starting |
| Log collection | Fluent-bit health, Loki streams | Log gaps create blind spots during incidents |
| DNS and connectivity | In-cluster name resolution | DNS failures often masquerade as application failures |
| Scheduled jobs | CronJob execution, service memory | Failed jobs and memory accumulation cause gradual degradation |
Scope and Responsibility¶
The Excalibur operator is responsible for application-level operations:
- Monitoring Excalibur application health.
- Monitoring storage usage and backups.
- Verifying observability and alerting systems.
- Confirming TLS and ingress connectivity.
- Monitoring tenant workloads.
- Reviewing logs and metrics for anomalies.
Infrastructure tasks are typically handled by the Kubernetes platform team — node lifecycle, Kubernetes version upgrades, and cluster-wide infrastructure. In smaller deployments the same person or team may own both roles. See Out of scope for the full list. Responsibility boundaries vary by environment — confirm them with your stakeholders before relying on this split.
Maintenance Cadence¶
Operational monitoring should rely primarily on automated alerting. Periodic manual reviews supplement alerting and catch slow-moving trends.
Continuous (alert-driven)¶
The default Helm deployment ships a baseline Prometheus alert rule (InstanceDown) that fires when any scrape target stops responding. Firing alerts appear in the Alerts panel on both built-in Grafana dashboards. Treat the default rule as a starting point — add your own rules and tune thresholds to match your SLOs.
Operators should define additional Prometheus rules for comprehensive coverage. Recommended categories:
- Pod health and restart spikes.
- Persistent volume utilization.
- Backup job failures.
- Observability stack health.
- Application error spikes.
- TLS certificate expiration.
Monthly¶
- Review Helm release versions for available updates.
- Review resource utilization trends (CPU, memory, storage).
- Remove access for any users who have left.
Quarterly¶
- Verify backup job history and confirm snapshots are being created.
- Review storage capacity and adjust PVC sizes if needed.
- Coordinate with the platform team on planned infrastructure changes.
Sharing Diagnostics with Support¶
When an issue exceeds what dashboards and kubectl can resolve, export application logs and share them with the Excalibur support team using Excalibur Chronicler. Chronicler queries Loki through the running Excalibur stack, packages the results into a portable archive, and optionally encrypts the output so that only Excalibur support can read it.
See Collect diagnostic data for full instructions.
Out of Scope¶
The following areas are typically managed by the cluster platform team and fall outside this guide:
- Node OS patching.
- Kubernetes control plane upgrades.
- Control plane monitoring.
- Cluster autoscaler configuration.
- CNI plugin management.
- Ingress controller lifecycle.
- CSI storage driver maintenance.
- DNS infrastructure management.
- Cluster-level disaster recovery.