Operational Checks for Excalibur Workloads¶

This page describes the application-level checks the Excalibur operator should run regularly. Each section follows the same pattern: what to check, why it matters, warning signs, and where to look.

Prerequisites

All commands assume kubectl access to the Excalibur namespace. Replace <namespace> with your deployment's namespace.

Core Platform Services¶

What to check¶

Verify that the core Excalibur services are running and ready. Typical components:

Component	Role
`api`	REST API and WebSocket gateway
`core`	Core orchestration service
`repository`	Database integration service
`token`	Token and session management
`identity-store`, `saml`	Authentication and identity integration
`ca`	Certificate authority — issuance, renewal, lifecycle
`proxy`	NGINX reverse proxy — web interface and PAM session recordings
`cache`	Redis session cache
`pam-orchestrator`	Manages the lifecycle of PAM sessions
`database`	MariaDB — single pod or 3-node Galera StatefulSet (`database-0`…`database-2`)
`mailer`	Email notification service
`rdp-proxy`, `ssh-proxy`	Dedicated protocol proxies for RDP and SSH sessions
`hsm`	HSM integration service
`dashboard`, `pam-client`	PVC initializer pods — populate static-file volumes on deployment

Why it matters¶

These services form the platform core. Each runs with multiple replicas in a production deployment, so a single failing pod is normally absorbed transparently by Kubernetes. The risk is sustained or correlated failures — multiple replicas of the same service unhealthy at once, or a stateful component (database, cache) losing quorum. Either pattern can disrupt authentication, sessions, or web interface access.

Frequent restarts on individual pods are still worth investigating — they usually point to a deeper issue (resource pressure, configuration drift, dependency failure) before they cause user-visible downtime.

Warning signs¶

Multiple replicas of the same service in CrashLoopBackOff or OOMKilled.
A service whose ready replica count drops below the desired count for an extended period.
A single pod restarting repeatedly even when other replicas remain healthy (degraded redundancy).
Readiness probes failing across replicas.
Sudden drops in API request rate.

Where to look¶

Grafana — Excalibur Kubernetes Metrics dashboard, Alerts panel.
Grafana — Excalibur Application Logs dashboard, Application logs rate and Log level rate panels.

Pod status and restart counts:

kubectl get pods -n <namespace>

Expected output:

NAME                                              READY   STATUS    RESTARTS   AGE
api-5884bcbf58-2r6kk                              1/1     Running   0          4d
backup-8d5fbcc56-47chb                            1/1     Running   0          4d
ca-54bdd7c6f7-8pltv                               1/1     Running   0          4d
cache-56bff8d97b-p9fvl                            1/1     Running   0          4d
core-7b9fdc7888-st8xb                             1/1     Running   0          4d
dashboard-5849b58576-g7ms7                        1/1     Running   0          4d
database-0                                        1/1     Running   0          4d
database-1                                        1/1     Running   0          4d
database-2                                        1/1     Running   0          4d
fluent-bit-2ks9z                                  1/1     Running   0          4d
fluent-bit-9rwvv                                  1/1     Running   0          4d
fluent-bit-nvfqb                                  1/1     Running   0          4d
grafana-57d875d8f8-vnlfr                          1/1     Running   0          4d
hsm-<hash>                                        1/1     Running   0          4d
identity-store-<hash>                             1/1     Running   0          4d
loki-<hash>                                       1/1     Running   0          4d
mailer-<hash>                                     1/1     Running   0          4d
pam-client-<hash>                                 1/1     Running   0          4d
pam-orchestrator-<hash>                           1/1     Running   0          4d
prometheus-7cbcfb4576-mdx7n                       1/1     Running   0          4d
proxy-<hash>                                      1/1     Running   0          4d
rdp-proxy-<hash>                                  1/1     Running   0          4d
repository-77dd849d8d-bfw7f                       1/1     Running   0          4d
saml-<hash>                                       1/1     Running   0          4d
ssh-proxy-<hash>                                  1/1     Running   0          4d
squid-<hash>                                      1/1     Running   0          4d
token-<hash>                                      1/1     Running   0          4d
...

Confirm READY matches the desired replica count for each Deployment and StatefulSet:

kubectl get deploy,statefulset -n <namespace>

Expected output:

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/api                1/1     1            1           139d
deployment.apps/backup             1/1     1            1           139d
deployment.apps/ca                 1/1     1            1           139d
deployment.apps/cache              1/1     1            1           139d
deployment.apps/core               1/1     1            1           139d
deployment.apps/dashboard          1/1     1            1           139d
deployment.apps/grafana            1/1     1            1           139d
deployment.apps/hsm                1/1     1            1           139d
deployment.apps/identity-store     1/1     1            1           139d
deployment.apps/loki               1/1     1            1           139d
deployment.apps/mailer             1/1     1            1           139d
deployment.apps/pam-client         1/1     1            1           139d
deployment.apps/pam-orchestrator   1/1     1            1           139d
deployment.apps/prometheus         1/1     1            1           139d
deployment.apps/proxy              1/1     1            1           139d
deployment.apps/rdp-proxy          1/1     1            1           139d
deployment.apps/repository         1/1     1            1           139d
deployment.apps/saml               1/1     1            1           139d
deployment.apps/squid              1/1     1            1           139d
deployment.apps/ssh-proxy          1/1     1            1           139d
deployment.apps/token              1/1     1            1           139d
...

NAME                       READY   AGE
statefulset.apps/database  3/3     139d

Primary investigation tool for any pod problem — check Events at the bottom and Last State under each container:

kubectl describe pod <pod-name> -n <namespace>

Expected output (relevant sections):

Containers:
  api:
    State:          Running
      Started:      Mon, 21 Apr 2026 09:15:42 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sun, 20 Apr 2026 12:00:03 +0000
      Finished:     Mon, 21 Apr 2026 09:15:30 +0000
    Ready:          True
    Restart Count:  1
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  4d    default-scheduler  Successfully assigned <namespace>/api-<hash> to <node>
  Normal  Pulled     4d    kubelet            Container image "ghcr.io/excalibur-enterprise/api:<version>" already present on machine
  Normal  Created    4d    kubelet            Created container api
  Normal  Started    4d    kubelet            Started container api

stdout/stderr from the previous container instance (useful when the pod has already restarted):

kubectl logs <pod-name> -n <namespace> --previous

Expected output:

{"level":"info","timestamp":"2026-04-20T12:00:01.234Z","message":"Starting API server on port 8080"}
{"level":"info","timestamp":"2026-04-20T12:00:02.567Z","message":"Connected to database"}
{"level":"error","timestamp":"2026-04-20T14:32:18.901Z","message":"Out of memory: killed process"}

Tenant Workloads¶

Excalibur deploys tenant-specific services when tenants are onboarded. They live in the same namespace as the core services. Pod names follow these patterns:

Pattern	Purpose
`pam-tenant-{N}`	PAM service instance for tenant N
`pam-tenant-{N}-envoy`	Envoy session protocol proxy for tenant N
`guacd-tenant-{N}`	Apache Guacamole daemon — HTML5 RDP, SSH, VNC sessions
`virtual-browser-tenant-{N}`	Virtual browser instances for tenant N
`tunnel-{N}-{M}`	Tunnel session pods

What to check¶

Ensure tenant workloads are running and ready after onboarding.

Why it matters¶

Tenant workloads are isolated. A failure typically affects a single tenant — but repeated failures across tenants indicate infrastructure issues.

Warning signs¶

Tenant pods stuck in Pending.
Repeated pod restarts.
Unexpected drops in active session counts.

Where to look¶

Grafana — Excalibur Application Metrics dashboard, Active PAM sessions, Active tunnels, and Active sessions panels.

List all tenant workload pods and their status:

kubectl get pods -n <namespace> | grep tenant

Expected output:

NAME                                              READY   STATUS    RESTARTS   AGE
guacd-tenant-0-deployment-865ff6fd97-qrb9d        1/1     Running   0          4d
guacd-tenant-2-deployment-7c549769d7-h5m2p        1/1     Running   0          4d
guacd-tenant-3-deployment-7655d9d748-qlmwl        1/1     Running   0          4d
pam-tenant-0-deployment-<hash>                    1/1     Running   0          4d
pam-tenant-0-envoy-<hash>                         1/1     Running   0          4d
pam-tenant-2-deployment-<hash>                    1/1     Running   0          4d
pam-tenant-2-envoy-<hash>                         1/1     Running   0          4d
virtual-browser-tenant-0-deployment-<hash>        1/1     Running   0          4d
...

Recent scheduling or container failure events for the namespace:

kubectl get events -n <namespace> --sort-by=.lastTimestamp

Expected output:

LAST SEEN   TYPE      REASON                  OBJECT                                          MESSAGE
5m          Normal    Scheduled               pod/pam-tenant-0-deployment-<hash>               Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node>
5m          Normal    Pulled                  pod/pam-tenant-0-deployment-<hash>               Container image already present on machine
5m          Normal    Created                 pod/pam-tenant-0-deployment-<hash>               Created container pam
5m          Normal    Started                 pod/pam-tenant-0-deployment-<hash>               Started container pam
3m          Warning   FailedScheduling        pod/virtual-browser-tenant-1-deployment-<hash>  0/6 nodes are available: insufficient resources. preemption: 0/6 nodes are available: No preemption victims found for incoming pod.

Persistent Storage¶

Excalibur relies on persistent storage for several critical components. The following volumes are present in all standard deployments:

Volume	Purpose	Default size
`database-data`	Primary database storage	10 Gi
`excalibur-data`	PAM session recordings and shared application data — typically the largest and fastest-growing volume	10 Gi
`backup-repository`	Backup storage	10 Gi
`certificates`, `keystore`	Certificate and key material	10 Mi each
`dashboard-static-files`, `pam-client-static-files`	Static frontend assets	100 Mi each
`squid-spool`	Proxy cache	100 Mi
`prometheus-data`, `loki-data`	Observability metrics and log storage	1 Gi each
`grafana-data`	Grafana configuration and state	100 Mi

All sizes are configurable in the Helm values. Production deployments should review and adjust them based on expected load — especially excalibur-data, database-data, and backup-repository.

What to check¶

Ensure persistent volumes are bound and have sufficient free capacity.

Why it matters¶

When a volume fills up, the service writing to it stops accepting new writes. Persistent storage is shared across replicas of stateful services — a full volume affects the entire service, not just one pod.

Warning signs¶

PVC usage approaching capacity.
Services reporting write errors.
Storage growing continuously without cleanup.

Where to look¶

Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels.

PVC bound status and capacity:

kubectl get pvc -n <namespace>

Expected output:

NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
backup-repository          Bound    pvc-756e41a4-c0ef-42cf-8884-ff7371499981   2Gi        RWX            azurefile      139d
certificates               Bound    pvc-d88fad7b-40dd-429c-bf32-43b7865d5ea1   10Mi       RWX            azurefile      139d
dashboard-static-files     Bound    pvc-e14b3b2c-0e14-40fe-b133-66113c990f12   100Mi      RWX            azurefile      139d
database-data-database-0   Bound    pvc-fef572eb-c90c-4846-b269-e096186d10e0   1Gi        RWO            managed        139d
database-data-database-1   Bound    pvc-a7e03a0a-77f7-431b-98ae-74fe3e261782   1Gi        RWO            managed        139d
database-data-database-2   Bound    pvc-f0db59e2-3aff-4b51-8c1b-9b45d4ce495b   1Gi        RWO            managed        139d
excalibur-data             Bound    pvc-d114890f-3068-418d-9712-5706435dd947   10Gi       RWX            azurefile      71d
grafana-data               Bound    pvc-87bcbcdb-0027-45ac-8cbe-2ef414ef4995   1Gi        RWO            managed        139d
keystore                   Bound    pvc-80804302-82e8-4fcd-aca9-097bc09855ee   10Mi       RWX            azurefile      139d
loki-data                  Bound    pvc-9ea26b34-02a9-48b0-b1d5-def6d637fc79   1Gi        RWO            managed        139d
pam-client-static-files    Bound    pvc-b8cddc68-ba96-4dc3-8860-f3e45cf17085   100Mi      RWX            azurefile      139d
prometheus-data            Bound    pvc-a59830b3-b776-4358-939d-823928617a2e   5Gi        RWO            managed        139d
squid-spool                Bound    pvc-3342cbc8-ce4c-4eb4-98ee-73cc2c55006a   100Mi      RWO            azurefile      139d

Volume binding events for a specific PVC:

kubectl describe pvc <name> -n <namespace>

Expected output (relevant sections):

Status:        Bound
Capacity:      10Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Events:        <none>

Backup and Data Protection¶

The backup pod runs a scheduled, automated backup of Excalibur application data into the backup-repository PVC. The schedule and retention policy are configurable through Helm values; the chart ships with conservative defaults suitable for evaluation. Review and tune both for your production recovery requirements.

What each backup run includes¶

A consistent dump of the application database (when a database backup target is configured).
A snapshot of Excalibur's persistent application data — PAM session recordings, certificates, keystore, and supporting files.

Snapshots are deduplicated and compressed inside the backup-repository PVC. The repository enforces a rolling retention policy across multiple tiers (most recent, hourly, daily, weekly); periodic maintenance prunes expired snapshots automatically.

Tune the defaults

The default backup schedule, retention policy, and backup-repository PVC size are intentionally modest. For production, set both the retention window and the PVC size to match your Recovery Point Objective (RPO) and the volume of session recordings you generate. Consider replicating the backup-repository contents to off-cluster storage for disaster recovery.

What to check¶

Confirm that backup runs are completing successfully and that the backup-repository volume has sufficient free capacity.

Why it matters¶

Backups are the primary recovery mechanism. A missed or failed backup run may go unnoticed until a restore is needed.

Warning signs¶

Backup pod logs reporting database dump or snapshot creation failures.
No new snapshots appearing within the configured backup interval.
backup-repository PVC usage growing continuously without cleanup.
The backup repository is unreadable or snapshot listing returns errors.

Where to look¶

Grafana — Explore with product = excalibur-v4, appName = backup.
Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels for backup-repository.

Backup pod status and recent log lines:

kubectl get pods -n <namespace> -l app=backup

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
backup-8d5fbcc56-47chb   1/1     Running   0          4d

Then tail recent logs:

kubectl logs -n <namespace> -l app=backup --tail=50

Expected output:

Dumping database to /volumes/database/dump.sql ...
Database dump completed successfully.
Creating snapshot of /volumes ...
Snapshot created successfully.

Observability Stack¶

Excalibur deployments include a built-in observability stack deployed as part of the Helm chart:

Component	Role
`prometheus`	Metrics collection and alerting
`grafana`	Dashboards; displays Prometheus alerts
`loki`	Log aggregation
`fluent-bit`	DaemonSet (one pod per node) — forwards container `stdout`/`stderr` to Loki

Prometheus scrape jobs¶

Job	Source
`prometheus`	Prometheus self-metrics
`kubelet-cadvisor`	Container CPU and memory metrics from cAdvisor
`kubelet-volumes`	PVC usage metrics from kubelet
`kubernetes-pods`	Infrastructure metrics for Grafana, Loki, and Prometheus pods
`excalibur`	Application metrics from Excalibur services exposing a `/metrics` endpoint

Retention¶

Metrics and logs are retained for a short, bounded window by default — sufficient for live operational use, but not sufficient for long-term forensic or compliance needs.

Tune retention for production

The default Prometheus and Loki retention windows are intentionally short to keep the observability PVCs small. Increase them — and the corresponding prometheus-data and loki-data PVC sizes — to match your incident-response and audit requirements. For long-term retention or compliance archiving, forward metrics and logs to an external system (for example, a remote Prometheus, Loki, or SIEM).

What to check¶

Ensure monitoring components are running and collecting data.

Why it matters¶

Monitoring failures create blind spots — issues become invisible until users report them.

Warning signs¶

Grafana dashboards showing no data.
Prometheus targets failing.
Logs missing for active services.

Where to look¶

Grafana — Excalibur Kubernetes Metrics dashboard, Metrics samples panel. All five scrape jobs (prometheus, kubelet-cadvisor, kubelet-volumes, kubernetes-pods, excalibur) should appear as UP.
Inspect Prometheus scrape targets directly — see Port-forward Prometheus.

Verify one fluent-bit pod per node:

kubectl get pods -n <namespace> -l app=fluent-bit

Expected output:

NAME               READY   STATUS    RESTARTS   AGE
fluent-bit-2ks9z   1/1     Running   0          4d
fluent-bit-9rwvv   1/1     Running   0          4d
fluent-bit-nvfqb   1/1     Running   0          4d
fluent-bit-pjb2c   1/1     Running   0          4d
fluent-bit-v79gn   1/1     Running   0          4d

Grafana Explore — product = excalibur-v4 in Loki — confirm log streams exist for all active services.

Sharing logs with Excalibur support

To export application logs from Loki and share them with the Excalibur support team, use Excalibur Chronicler. See Collect diagnostic data for full instructions, including encrypted exports.

TLS and Ingress¶

Excalibur is exposed through a Kubernetes Ingress named proxy, which routes traffic to the proxy service on port 8000. TLS is managed by cert-manager using a Certificate resource named <hostname>-tls (for example excalibur.xclbr.com-tls), stored in a Secret of the same name.

The default ClusterIssuer is letsencrypt-production. Let's Encrypt issues a 90-day certificate using the HTTP-01 ACME challenge, and cert-manager renews it automatically about 30 days before expiry.

What to check¶

Confirm the ingress endpoint is reachable and TLS certificates are valid.

Why it matters¶

Ingress and TLS are the entry point for every user. Even with multiple replicas inside the cluster, an ingress misconfiguration or expired certificate makes the platform unreachable from outside. The ingress controller itself should also be highly available — confirm it with your platform team.

Warning signs¶

TLS certificates nearing expiration.
HTTP 502 or 503 responses.
Browser security warnings.

Where to look¶

Verify the ingress has an external address assigned:

kubectl get ingress proxy -n <namespace>

Expected output:

NAME    CLASS                                HOSTS               ADDRESS         PORTS     AGE
proxy   webapprouting.kubernetes.azure.com   <excalibur-hostname> <ingress-ip>    80, 443   139d

Verify Ready: True under Conditions and check Not After under Status:

kubectl describe certificate <hostname>-tls -n <namespace>

Expected output (relevant sections):

Status:
  Conditions:
    Type:                  Ready
    Status:                True
    Message:               Certificate is up to date and has not expired
  Not After:               2026-07-20T10:15:00Z
  Not Before:              2026-04-21T09:15:00Z
  Renewal Time:            2026-06-19T10:15:00Z

Confirm the TLS Secret exists and is populated:

kubectl get secret <hostname>-tls -n <namespace>

Expected output:

NAME                       TYPE                DATA   AGE
<excalibur-hostname>-tls   kubernetes.io/tls   2      139d

External connectivity check:

curl -I https://<excalibur-hostname>/healthz

Expected output:

HTTP/2 200
date: Mon, 21 Apr 2026 12:26:43 GMT
content-type: text/html
content-length: 2277

Scheduled Jobs and Service Restarts¶

The Excalibur Helm chart does not install Kubernetes CronJob resources. The only automated scheduled process is the backup cron running inside the backup pod (see Backup and data protection). Any other CronJob resources in the namespace were added by the operator.

Rolling restart to reclaim memory¶

If pods accumulate memory over time, a rolling restart returns usage to baseline without a service outage:

kubectl rollout restart deployment/<service-name> -n <namespace>

What to check¶

If operator-defined CronJob resources are present, confirm they execute on schedule and complete successfully.

Why it matters¶

Long-running services such as pam or core can accumulate memory. A rolling restart reclaims memory without requiring a full outage.

Warning signs¶

Failed CronJob executions.
Jobs not running on schedule.
Memory usage rising steadily across specific pods.

Where to look¶

List scheduled jobs in the namespace:
```
kubectl get cronjob -n <namespace>
```
Expected output:
```
No resources found in <namespace> namespace.
```
If no CronJob resources exist, this is the default state — the chart does not install any.
Recent job run status:
```
kubectl get jobs -n <namespace>
```
Expected output:
```
No resources found in <namespace> namespace.
```
If no Job resources exist, no CronJobs have fired recently.
Grafana — Excalibur Kubernetes Metrics dashboard, Memory by pod panel — confirm memory drops after each scheduled restart.