System Maintenance¶

This guide describes the operational checks required to run Excalibur on a Kubernetes cluster deployed with the official Helm chart. The focus is practical day-2 operations — detecting problems early, maintaining stability, and keeping Excalibur services available and observable.

The intended reader is the Excalibur operator — the DevOps, SRE, platform, or system administrator responsible for the Excalibur deployment. This may be a partner managing Excalibur for an end customer, an in-house operations team running Excalibur for their organization, or a managed-service provider. Throughout this guide, operator refers to whichever role owns day-2 operations in your environment.

This guide assumes you operate Excalibur workloads on Kubernetes, but do not necessarily manage the Kubernetes platform itself.

Prerequisites

Excalibur is deployed on Kubernetes using the official Helm chart.
You have kubectl access to the Excalibur namespace.
You can sign in to Grafana — see Access Grafana and Prometheus for the supported access methods.

High Availability and What "Down" Means¶

In a production deployment, most Excalibur services run with multiple replicas behind a Kubernetes Service. Kubernetes load-balances requests across healthy pods, restarts failed pods automatically, and reschedules them to other nodes when needed. A single pod restart, eviction, or OOMKilled event is a routine, self-healing condition — not an outage.

A user-visible incident typically requires one of the following:

All replicas of a stateless service are unhealthy at the same time.
A stateful component loses quorum or its only replica (for example, the database StatefulSet, the cache, or persistent storage).
Ingress or TLS is broken — nothing reaches the cluster.
A shared dependency fails — DNS, identity store, or external network.

When this guide says a service "becomes unavailable," it refers to the service as a whole (every replica unhealthy, or a stateful component down), not a single pod restart. Treat individual pod restarts as signals to investigate trends — frequent restarts indicate an underlying problem long before they cause downtime.

Replica counts are configurable

Replica counts for each service are set in the Helm values. The chart defaults are designed for evaluation; production deployments should size replicas based on expected load, redundancy targets, and node failure tolerance. See the Installation and implementation guide for sizing guidance.

Where to Start¶

Excalibur Workloads

Monitor core services, tenant workloads, storage, backups, observability, ingress, and scheduled jobs.

Operational checks for Excalibur
Cluster Health

Cluster-level signals that affect Excalibur — node health, resource usage, scheduling, log collection, and DNS.

Cluster health signals
Access Grafana

Grafana is not exposed externally by default. Reach it through kubectl port-forward, maintenance mode, or explicit Helm exposure.

Access Grafana and Prometheus
Grafana Dashboards

Reference for the three pre-built Excalibur dashboards and how to investigate logs in Explore.

Dashboard reference

What to Monitor¶

The table below summarizes the operational areas covered by this guide and why each one matters.

Area	What to monitor	Why it matters
Core services	Pod health, restarts, readiness, replica count	Sustained replica loss disrupts authentication, sessions, or API access
Tenant workloads	Per-tenant pods and session services	Failures affect specific customers or sessions
Storage	Persistent volume usage	Exhaustion causes service failures or data loss
Backups	Backup jobs and repository capacity	Backups are required for recovery after data loss
Observability	Prometheus, Grafana, Loki, fluent-bit	Monitoring must work to detect problems
Ingress and TLS	External connectivity and certificates	The platform must remain accessible to users
Cluster health	Node status, resource pressure	Cluster instability often surfaces as application failures
Pod scheduling	Pending pods, scheduling errors	Failed scheduling prevents new or restarted pods from starting
Log collection	Fluent-bit health, Loki streams	Log gaps create blind spots during incidents
DNS and connectivity	In-cluster name resolution	DNS failures often masquerade as application failures
Scheduled jobs	CronJob execution, service memory	Failed jobs and memory accumulation cause gradual degradation

Scope and Responsibility¶

The Excalibur operator is responsible for application-level operations:

Monitoring Excalibur application health.
Monitoring storage usage and backups.
Verifying observability and alerting systems.
Confirming TLS and ingress connectivity.
Monitoring tenant workloads.
Reviewing logs and metrics for anomalies.

Infrastructure tasks are typically handled by the Kubernetes platform team — node lifecycle, Kubernetes version upgrades, and cluster-wide infrastructure. In smaller deployments the same person or team may own both roles. See Out of scope for the full list. Responsibility boundaries vary by environment — confirm them with your stakeholders before relying on this split.

Maintenance Cadence¶

Operational monitoring should rely primarily on automated alerting. Periodic manual reviews supplement alerting and catch slow-moving trends.

Continuous (alert-driven)¶

The default Helm deployment ships a baseline Prometheus alert rule (InstanceDown) that fires when any scrape target stops responding. Firing alerts appear in the Alerts panel on both built-in Grafana dashboards. Treat the default rule as a starting point — add your own rules and tune thresholds to match your SLOs.

Operators should define additional Prometheus rules for comprehensive coverage. Recommended categories:

Pod health and restart spikes.
Persistent volume utilization.
Backup job failures.
Observability stack health.
Application error spikes.
TLS certificate expiration.

Monthly¶

Review Helm release versions for available updates.
Review resource utilization trends (CPU, memory, storage).
Remove access for any users who have left.

Quarterly¶

Verify backup job history and confirm snapshots are being created.
Review storage capacity and adjust PVC sizes if needed.
Coordinate with the platform team on planned infrastructure changes.

When an issue exceeds what dashboards and kubectl can resolve, export application logs and share them with the Excalibur support team using Excalibur Chronicler. Chronicler queries Loki through the running Excalibur stack, packages the results into a portable archive, and optionally encrypts the output so that only Excalibur support can read it.

See Collect diagnostic data for full instructions.

Out of Scope¶

The following areas are typically managed by the cluster platform team and fall outside this guide:

Node OS patching.
Kubernetes control plane upgrades.
Control plane monitoring.
Cluster autoscaler configuration.
CNI plugin management.
Ingress controller lifecycle.
CSI storage driver maintenance.
DNS infrastructure management.
Cluster-level disaster recovery.