Operational Checks for Excalibur Workloads¶
This page describes the application-level checks the Excalibur operator should run regularly. Each section follows the same pattern: what to check, why it matters, warning signs, and where to look.
Prerequisites
All commands assume kubectl access to the Excalibur namespace. Replace <namespace> with your deployment's namespace.
Core Platform Services¶
What to check¶
Verify that the core Excalibur services are running and ready. Typical components:
| Component | Role |
|---|---|
api |
REST API and WebSocket gateway |
core |
Core orchestration service |
repository |
Database integration service |
token |
Token and session management |
identity-store, saml |
Authentication and identity integration |
ca |
Certificate authority — issuance, renewal, lifecycle |
proxy |
NGINX reverse proxy — web interface and PAM session recordings |
cache |
Redis session cache |
pam-orchestrator |
Manages the lifecycle of PAM sessions |
database |
MariaDB — single pod or 3-node Galera StatefulSet (database-0…database-2) |
mailer |
Email notification service |
rdp-proxy, ssh-proxy |
Dedicated protocol proxies for RDP and SSH sessions |
hsm |
HSM integration service |
dashboard, pam-client |
PVC initializer pods — populate static-file volumes on deployment |
Why it matters¶
These services form the platform core. Each runs with multiple replicas in a production deployment, so a single failing pod is normally absorbed transparently by Kubernetes. The risk is sustained or correlated failures — multiple replicas of the same service unhealthy at once, or a stateful component (database, cache) losing quorum. Either pattern can disrupt authentication, sessions, or web interface access.
Frequent restarts on individual pods are still worth investigating — they usually point to a deeper issue (resource pressure, configuration drift, dependency failure) before they cause user-visible downtime.
Warning signs¶
- Multiple replicas of the same service in
CrashLoopBackOfforOOMKilled. - A service whose ready replica count drops below the desired count for an extended period.
- A single pod restarting repeatedly even when other replicas remain healthy (degraded redundancy).
- Readiness probes failing across replicas.
- Sudden drops in API request rate.
Where to look¶
- Grafana — Excalibur Kubernetes Metrics dashboard, Alerts panel.
- Grafana — Excalibur Application Logs dashboard, Application logs rate and Log level rate panels.
-
Pod status and restart counts:
kubectl get pods -n <namespace>Expected output:
NAME READY STATUS RESTARTS AGE api-5884bcbf58-2r6kk 1/1 Running 0 4d backup-8d5fbcc56-47chb 1/1 Running 0 4d ca-54bdd7c6f7-8pltv 1/1 Running 0 4d cache-56bff8d97b-p9fvl 1/1 Running 0 4d core-7b9fdc7888-st8xb 1/1 Running 0 4d dashboard-5849b58576-g7ms7 1/1 Running 0 4d database-0 1/1 Running 0 4d database-1 1/1 Running 0 4d database-2 1/1 Running 0 4d fluent-bit-2ks9z 1/1 Running 0 4d fluent-bit-9rwvv 1/1 Running 0 4d fluent-bit-nvfqb 1/1 Running 0 4d grafana-57d875d8f8-vnlfr 1/1 Running 0 4d hsm-<hash> 1/1 Running 0 4d identity-store-<hash> 1/1 Running 0 4d loki-<hash> 1/1 Running 0 4d mailer-<hash> 1/1 Running 0 4d pam-client-<hash> 1/1 Running 0 4d pam-orchestrator-<hash> 1/1 Running 0 4d prometheus-7cbcfb4576-mdx7n 1/1 Running 0 4d proxy-<hash> 1/1 Running 0 4d rdp-proxy-<hash> 1/1 Running 0 4d repository-77dd849d8d-bfw7f 1/1 Running 0 4d saml-<hash> 1/1 Running 0 4d ssh-proxy-<hash> 1/1 Running 0 4d squid-<hash> 1/1 Running 0 4d token-<hash> 1/1 Running 0 4d ... -
Confirm
READYmatches the desired replica count for each Deployment and StatefulSet:kubectl get deploy,statefulset -n <namespace>Expected output:
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/api 1/1 1 1 139d deployment.apps/backup 1/1 1 1 139d deployment.apps/ca 1/1 1 1 139d deployment.apps/cache 1/1 1 1 139d deployment.apps/core 1/1 1 1 139d deployment.apps/dashboard 1/1 1 1 139d deployment.apps/grafana 1/1 1 1 139d deployment.apps/hsm 1/1 1 1 139d deployment.apps/identity-store 1/1 1 1 139d deployment.apps/loki 1/1 1 1 139d deployment.apps/mailer 1/1 1 1 139d deployment.apps/pam-client 1/1 1 1 139d deployment.apps/pam-orchestrator 1/1 1 1 139d deployment.apps/prometheus 1/1 1 1 139d deployment.apps/proxy 1/1 1 1 139d deployment.apps/rdp-proxy 1/1 1 1 139d deployment.apps/repository 1/1 1 1 139d deployment.apps/saml 1/1 1 1 139d deployment.apps/squid 1/1 1 1 139d deployment.apps/ssh-proxy 1/1 1 1 139d deployment.apps/token 1/1 1 1 139d ... NAME READY AGE statefulset.apps/database 3/3 139d -
Primary investigation tool for any pod problem — check Events at the bottom and Last State under each container:
kubectl describe pod <pod-name> -n <namespace>Expected output (relevant sections):
Containers: api: State: Running Started: Mon, 21 Apr 2026 09:15:42 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Sun, 20 Apr 2026 12:00:03 +0000 Finished: Mon, 21 Apr 2026 09:15:30 +0000 Ready: True Restart Count: 1 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4d default-scheduler Successfully assigned <namespace>/api-<hash> to <node> Normal Pulled 4d kubelet Container image "ghcr.io/excalibur-enterprise/api:<version>" already present on machine Normal Created 4d kubelet Created container api Normal Started 4d kubelet Started container api -
stdout/stderrfrom the previous container instance (useful when the pod has already restarted):kubectl logs <pod-name> -n <namespace> --previousExpected output:
{"level":"info","timestamp":"2026-04-20T12:00:01.234Z","message":"Starting API server on port 8080"} {"level":"info","timestamp":"2026-04-20T12:00:02.567Z","message":"Connected to database"} {"level":"error","timestamp":"2026-04-20T14:32:18.901Z","message":"Out of memory: killed process"}
Tenant Workloads¶
Excalibur deploys tenant-specific services when tenants are onboarded. They live in the same namespace as the core services. Pod names follow these patterns:
| Pattern | Purpose |
|---|---|
pam-tenant-{N} |
PAM service instance for tenant N |
pam-tenant-{N}-envoy |
Envoy session protocol proxy for tenant N |
guacd-tenant-{N} |
Apache Guacamole daemon — HTML5 RDP, SSH, VNC sessions |
virtual-browser-tenant-{N} |
Virtual browser instances for tenant N |
tunnel-{N}-{M} |
Tunnel session pods |
What to check¶
Ensure tenant workloads are running and ready after onboarding.
Why it matters¶
Tenant workloads are isolated. A failure typically affects a single tenant — but repeated failures across tenants indicate infrastructure issues.
Warning signs¶
- Tenant pods stuck in
Pending. - Repeated pod restarts.
- Unexpected drops in active session counts.
Where to look¶
- Grafana — Excalibur Application Metrics dashboard, Active PAM sessions, Active tunnels, and Active sessions panels.
-
List all tenant workload pods and their status:
kubectl get pods -n <namespace> | grep tenantExpected output:
NAME READY STATUS RESTARTS AGE guacd-tenant-0-deployment-865ff6fd97-qrb9d 1/1 Running 0 4d guacd-tenant-2-deployment-7c549769d7-h5m2p 1/1 Running 0 4d guacd-tenant-3-deployment-7655d9d748-qlmwl 1/1 Running 0 4d pam-tenant-0-deployment-<hash> 1/1 Running 0 4d pam-tenant-0-envoy-<hash> 1/1 Running 0 4d pam-tenant-2-deployment-<hash> 1/1 Running 0 4d pam-tenant-2-envoy-<hash> 1/1 Running 0 4d virtual-browser-tenant-0-deployment-<hash> 1/1 Running 0 4d ... -
Recent scheduling or container failure events for the namespace:
kubectl get events -n <namespace> --sort-by=.lastTimestampExpected output:
LAST SEEN TYPE REASON OBJECT MESSAGE 5m Normal Scheduled pod/pam-tenant-0-deployment-<hash> Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node> 5m Normal Pulled pod/pam-tenant-0-deployment-<hash> Container image already present on machine 5m Normal Created pod/pam-tenant-0-deployment-<hash> Created container pam 5m Normal Started pod/pam-tenant-0-deployment-<hash> Started container pam 3m Warning FailedScheduling pod/virtual-browser-tenant-1-deployment-<hash> 0/6 nodes are available: insufficient resources. preemption: 0/6 nodes are available: No preemption victims found for incoming pod.
Persistent Storage¶
Excalibur relies on persistent storage for several critical components. The following volumes are present in all standard deployments:
| Volume | Purpose | Default size |
|---|---|---|
database-data |
Primary database storage | 10 Gi |
excalibur-data |
PAM session recordings and shared application data — typically the largest and fastest-growing volume | 10 Gi |
backup-repository |
Backup storage | 10 Gi |
certificates, keystore |
Certificate and key material | 10 Mi each |
dashboard-static-files, pam-client-static-files |
Static frontend assets | 100 Mi each |
squid-spool |
Proxy cache | 100 Mi |
prometheus-data, loki-data |
Observability metrics and log storage | 1 Gi each |
grafana-data |
Grafana configuration and state | 100 Mi |
All sizes are configurable in the Helm values. Production deployments should review and adjust them based on expected load — especially excalibur-data, database-data, and backup-repository.
What to check¶
Ensure persistent volumes are bound and have sufficient free capacity.
Why it matters¶
When a volume fills up, the service writing to it stops accepting new writes. Persistent storage is shared across replicas of stateful services — a full volume affects the entire service, not just one pod.
Warning signs¶
- PVC usage approaching capacity.
- Services reporting write errors.
- Storage growing continuously without cleanup.
Where to look¶
- Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels.
-
PVC bound status and capacity:
kubectl get pvc -n <namespace>Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE backup-repository Bound pvc-756e41a4-c0ef-42cf-8884-ff7371499981 2Gi RWX azurefile 139d certificates Bound pvc-d88fad7b-40dd-429c-bf32-43b7865d5ea1 10Mi RWX azurefile 139d dashboard-static-files Bound pvc-e14b3b2c-0e14-40fe-b133-66113c990f12 100Mi RWX azurefile 139d database-data-database-0 Bound pvc-fef572eb-c90c-4846-b269-e096186d10e0 1Gi RWO managed 139d database-data-database-1 Bound pvc-a7e03a0a-77f7-431b-98ae-74fe3e261782 1Gi RWO managed 139d database-data-database-2 Bound pvc-f0db59e2-3aff-4b51-8c1b-9b45d4ce495b 1Gi RWO managed 139d excalibur-data Bound pvc-d114890f-3068-418d-9712-5706435dd947 10Gi RWX azurefile 71d grafana-data Bound pvc-87bcbcdb-0027-45ac-8cbe-2ef414ef4995 1Gi RWO managed 139d keystore Bound pvc-80804302-82e8-4fcd-aca9-097bc09855ee 10Mi RWX azurefile 139d loki-data Bound pvc-9ea26b34-02a9-48b0-b1d5-def6d637fc79 1Gi RWO managed 139d pam-client-static-files Bound pvc-b8cddc68-ba96-4dc3-8860-f3e45cf17085 100Mi RWX azurefile 139d prometheus-data Bound pvc-a59830b3-b776-4358-939d-823928617a2e 5Gi RWO managed 139d squid-spool Bound pvc-3342cbc8-ce4c-4eb4-98ee-73cc2c55006a 100Mi RWO azurefile 139d -
Volume binding events for a specific PVC:
kubectl describe pvc <name> -n <namespace>Expected output (relevant sections):
Status: Bound Capacity: 10Gi Access Modes: RWX VolumeMode: Filesystem Events: <none>
Backup and Data Protection¶
The backup pod runs a scheduled, automated backup of Excalibur application data into the backup-repository PVC. The schedule and retention policy are configurable through Helm values; the chart ships with conservative defaults suitable for evaluation. Review and tune both for your production recovery requirements.
What each backup run includes¶
- A consistent dump of the application database (when a database backup target is configured).
- A snapshot of Excalibur's persistent application data — PAM session recordings, certificates, keystore, and supporting files.
Snapshots are deduplicated and compressed inside the backup-repository PVC. The repository enforces a rolling retention policy across multiple tiers (most recent, hourly, daily, weekly); periodic maintenance prunes expired snapshots automatically.
Tune the defaults
The default backup schedule, retention policy, and backup-repository PVC size are intentionally modest. For production, set both the retention window and the PVC size to match your Recovery Point Objective (RPO) and the volume of session recordings you generate. Consider replicating the backup-repository contents to off-cluster storage for disaster recovery.
What to check¶
Confirm that backup runs are completing successfully and that the backup-repository volume has sufficient free capacity.
Why it matters¶
Backups are the primary recovery mechanism. A missed or failed backup run may go unnoticed until a restore is needed.
Warning signs¶
- Backup pod logs reporting database dump or snapshot creation failures.
- No new snapshots appearing within the configured backup interval.
backup-repositoryPVC usage growing continuously without cleanup.- The backup repository is unreadable or snapshot listing returns errors.
Where to look¶
- Grafana — Explore with
product = excalibur-v4,appName = backup. - Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels for
backup-repository. -
Backup pod status and recent log lines:
kubectl get pods -n <namespace> -l app=backupExpected output:
NAME READY STATUS RESTARTS AGE backup-8d5fbcc56-47chb 1/1 Running 0 4dThen tail recent logs:
kubectl logs -n <namespace> -l app=backup --tail=50Expected output:
Dumping database to /volumes/database/dump.sql ... Database dump completed successfully. Creating snapshot of /volumes ... Snapshot created successfully.
Observability Stack¶
Excalibur deployments include a built-in observability stack deployed as part of the Helm chart:
| Component | Role |
|---|---|
prometheus |
Metrics collection and alerting |
grafana |
Dashboards; displays Prometheus alerts |
loki |
Log aggregation |
fluent-bit |
DaemonSet (one pod per node) — forwards container stdout/stderr to Loki |
Prometheus scrape jobs¶
| Job | Source |
|---|---|
prometheus |
Prometheus self-metrics |
kubelet-cadvisor |
Container CPU and memory metrics from cAdvisor |
kubelet-volumes |
PVC usage metrics from kubelet |
kubernetes-pods |
Infrastructure metrics for Grafana, Loki, and Prometheus pods |
excalibur |
Application metrics from Excalibur services exposing a /metrics endpoint |
Retention¶
Metrics and logs are retained for a short, bounded window by default — sufficient for live operational use, but not sufficient for long-term forensic or compliance needs.
Tune retention for production
The default Prometheus and Loki retention windows are intentionally short to keep the observability PVCs small. Increase them — and the corresponding prometheus-data and loki-data PVC sizes — to match your incident-response and audit requirements. For long-term retention or compliance archiving, forward metrics and logs to an external system (for example, a remote Prometheus, Loki, or SIEM).
What to check¶
Ensure monitoring components are running and collecting data.
Why it matters¶
Monitoring failures create blind spots — issues become invisible until users report them.
Warning signs¶
- Grafana dashboards showing no data.
- Prometheus targets failing.
- Logs missing for active services.
Where to look¶
- Grafana — Excalibur Kubernetes Metrics dashboard, Metrics samples panel. All five scrape jobs (
prometheus,kubelet-cadvisor,kubelet-volumes,kubernetes-pods,excalibur) should appear asUP. - Inspect Prometheus scrape targets directly — see Port-forward Prometheus.
-
Verify one fluent-bit pod per node:
kubectl get pods -n <namespace> -l app=fluent-bitExpected output:
NAME READY STATUS RESTARTS AGE fluent-bit-2ks9z 1/1 Running 0 4d fluent-bit-9rwvv 1/1 Running 0 4d fluent-bit-nvfqb 1/1 Running 0 4d fluent-bit-pjb2c 1/1 Running 0 4d fluent-bit-v79gn 1/1 Running 0 4d -
Grafana Explore —
product = excalibur-v4in Loki — confirm log streams exist for all active services.
Sharing logs with Excalibur support
To export application logs from Loki and share them with the Excalibur support team, use Excalibur Chronicler. See Collect diagnostic data for full instructions, including encrypted exports.
TLS and Ingress¶
Excalibur is exposed through a Kubernetes Ingress named proxy, which routes traffic to the proxy service on port 8000. TLS is managed by cert-manager using a Certificate resource named <hostname>-tls (for example excalibur.xclbr.com-tls), stored in a Secret of the same name.
The default ClusterIssuer is letsencrypt-production. Let's Encrypt issues a 90-day certificate using the HTTP-01 ACME challenge, and cert-manager renews it automatically about 30 days before expiry.
What to check¶
Confirm the ingress endpoint is reachable and TLS certificates are valid.
Why it matters¶
Ingress and TLS are the entry point for every user. Even with multiple replicas inside the cluster, an ingress misconfiguration or expired certificate makes the platform unreachable from outside. The ingress controller itself should also be highly available — confirm it with your platform team.
Warning signs¶
- TLS certificates nearing expiration.
- HTTP
502or503responses. - Browser security warnings.
Where to look¶
-
Verify the ingress has an external address assigned:
kubectl get ingress proxy -n <namespace>Expected output:
NAME CLASS HOSTS ADDRESS PORTS AGE proxy webapprouting.kubernetes.azure.com <excalibur-hostname> <ingress-ip> 80, 443 139d -
Verify
Ready: Trueunder Conditions and checkNot Afterunder Status:kubectl describe certificate <hostname>-tls -n <namespace>Expected output (relevant sections):
Status: Conditions: Type: Ready Status: True Message: Certificate is up to date and has not expired Not After: 2026-07-20T10:15:00Z Not Before: 2026-04-21T09:15:00Z Renewal Time: 2026-06-19T10:15:00Z -
Confirm the TLS
Secretexists and is populated:kubectl get secret <hostname>-tls -n <namespace>Expected output:
NAME TYPE DATA AGE <excalibur-hostname>-tls kubernetes.io/tls 2 139d -
External connectivity check:
curl -I https://<excalibur-hostname>/healthzExpected output:
HTTP/2 200 date: Mon, 21 Apr 2026 12:26:43 GMT content-type: text/html content-length: 2277
Scheduled Jobs and Service Restarts¶
The Excalibur Helm chart does not install Kubernetes CronJob resources. The only automated scheduled process is the backup cron running inside the backup pod (see Backup and data protection). Any other CronJob resources in the namespace were added by the operator.
Rolling restart to reclaim memory¶
If pods accumulate memory over time, a rolling restart returns usage to baseline without a service outage:
kubectl rollout restart deployment/<service-name> -n <namespace>
What to check¶
If operator-defined CronJob resources are present, confirm they execute on schedule and complete successfully.
Why it matters¶
Long-running services such as pam or core can accumulate memory. A rolling restart reclaims memory without requiring a full outage.
Warning signs¶
- Failed
CronJobexecutions. - Jobs not running on schedule.
- Memory usage rising steadily across specific pods.
Where to look¶
-
List scheduled jobs in the namespace:
kubectl get cronjob -n <namespace>Expected output:
No resources found in <namespace> namespace.If no
CronJobresources exist, this is the default state — the chart does not install any. -
Recent job run status:
kubectl get jobs -n <namespace>Expected output:
No resources found in <namespace> namespace.If no
Jobresources exist, no CronJobs have fired recently. -
Grafana — Excalibur Kubernetes Metrics dashboard, Memory by pod panel — confirm memory drops after each scheduled restart.