Skip to content

Operational Checks for Excalibur Workloads

This page describes the application-level checks the Excalibur operator should run regularly. Each section follows the same pattern: what to check, why it matters, warning signs, and where to look.

Prerequisites

All commands assume kubectl access to the Excalibur namespace. Replace <namespace> with your deployment's namespace.

Core Platform Services

What to check

Verify that the core Excalibur services are running and ready. Typical components:

Component Role
api REST API and WebSocket gateway
core Core orchestration service
repository Database integration service
token Token and session management
identity-store, saml Authentication and identity integration
ca Certificate authority — issuance, renewal, lifecycle
proxy NGINX reverse proxy — web interface and PAM session recordings
cache Redis session cache
pam-orchestrator Manages the lifecycle of PAM sessions
database MariaDB — single pod or 3-node Galera StatefulSet (database-0database-2)
mailer Email notification service
rdp-proxy, ssh-proxy Dedicated protocol proxies for RDP and SSH sessions
hsm HSM integration service
dashboard, pam-client PVC initializer pods — populate static-file volumes on deployment

Why it matters

These services form the platform core. Each runs with multiple replicas in a production deployment, so a single failing pod is normally absorbed transparently by Kubernetes. The risk is sustained or correlated failures — multiple replicas of the same service unhealthy at once, or a stateful component (database, cache) losing quorum. Either pattern can disrupt authentication, sessions, or web interface access.

Frequent restarts on individual pods are still worth investigating — they usually point to a deeper issue (resource pressure, configuration drift, dependency failure) before they cause user-visible downtime.

Warning signs

  • Multiple replicas of the same service in CrashLoopBackOff or OOMKilled.
  • A service whose ready replica count drops below the desired count for an extended period.
  • A single pod restarting repeatedly even when other replicas remain healthy (degraded redundancy).
  • Readiness probes failing across replicas.
  • Sudden drops in API request rate.

Where to look

  • Grafana — Excalibur Kubernetes Metrics dashboard, Alerts panel.
  • Grafana — Excalibur Application Logs dashboard, Application logs rate and Log level rate panels.
  • Pod status and restart counts:

    kubectl get pods -n <namespace>
    

    Expected output:

    NAME                                              READY   STATUS    RESTARTS   AGE
    api-5884bcbf58-2r6kk                              1/1     Running   0          4d
    backup-8d5fbcc56-47chb                            1/1     Running   0          4d
    ca-54bdd7c6f7-8pltv                               1/1     Running   0          4d
    cache-56bff8d97b-p9fvl                            1/1     Running   0          4d
    core-7b9fdc7888-st8xb                             1/1     Running   0          4d
    dashboard-5849b58576-g7ms7                        1/1     Running   0          4d
    database-0                                        1/1     Running   0          4d
    database-1                                        1/1     Running   0          4d
    database-2                                        1/1     Running   0          4d
    fluent-bit-2ks9z                                  1/1     Running   0          4d
    fluent-bit-9rwvv                                  1/1     Running   0          4d
    fluent-bit-nvfqb                                  1/1     Running   0          4d
    grafana-57d875d8f8-vnlfr                          1/1     Running   0          4d
    hsm-<hash>                                        1/1     Running   0          4d
    identity-store-<hash>                             1/1     Running   0          4d
    loki-<hash>                                       1/1     Running   0          4d
    mailer-<hash>                                     1/1     Running   0          4d
    pam-client-<hash>                                 1/1     Running   0          4d
    pam-orchestrator-<hash>                           1/1     Running   0          4d
    prometheus-7cbcfb4576-mdx7n                       1/1     Running   0          4d
    proxy-<hash>                                      1/1     Running   0          4d
    rdp-proxy-<hash>                                  1/1     Running   0          4d
    repository-77dd849d8d-bfw7f                       1/1     Running   0          4d
    saml-<hash>                                       1/1     Running   0          4d
    ssh-proxy-<hash>                                  1/1     Running   0          4d
    squid-<hash>                                      1/1     Running   0          4d
    token-<hash>                                      1/1     Running   0          4d
    ...
    
  • Confirm READY matches the desired replica count for each Deployment and StatefulSet:

    kubectl get deploy,statefulset -n <namespace>
    

    Expected output:

    NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/api                1/1     1            1           139d
    deployment.apps/backup             1/1     1            1           139d
    deployment.apps/ca                 1/1     1            1           139d
    deployment.apps/cache              1/1     1            1           139d
    deployment.apps/core               1/1     1            1           139d
    deployment.apps/dashboard          1/1     1            1           139d
    deployment.apps/grafana            1/1     1            1           139d
    deployment.apps/hsm                1/1     1            1           139d
    deployment.apps/identity-store     1/1     1            1           139d
    deployment.apps/loki               1/1     1            1           139d
    deployment.apps/mailer             1/1     1            1           139d
    deployment.apps/pam-client         1/1     1            1           139d
    deployment.apps/pam-orchestrator   1/1     1            1           139d
    deployment.apps/prometheus         1/1     1            1           139d
    deployment.apps/proxy              1/1     1            1           139d
    deployment.apps/rdp-proxy          1/1     1            1           139d
    deployment.apps/repository         1/1     1            1           139d
    deployment.apps/saml               1/1     1            1           139d
    deployment.apps/squid              1/1     1            1           139d
    deployment.apps/ssh-proxy          1/1     1            1           139d
    deployment.apps/token              1/1     1            1           139d
    ...
    
    NAME                       READY   AGE
    statefulset.apps/database  3/3     139d
    
  • Primary investigation tool for any pod problem — check Events at the bottom and Last State under each container:

    kubectl describe pod <pod-name> -n <namespace>
    

    Expected output (relevant sections):

    Containers:
      api:
        State:          Running
          Started:      Mon, 21 Apr 2026 09:15:42 +0000
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Sun, 20 Apr 2026 12:00:03 +0000
          Finished:     Mon, 21 Apr 2026 09:15:30 +0000
        Ready:          True
        Restart Count:  1
    ...
    Events:
      Type    Reason     Age   From               Message
      ----    ------     ----  ----               -------
      Normal  Scheduled  4d    default-scheduler  Successfully assigned <namespace>/api-<hash> to <node>
      Normal  Pulled     4d    kubelet            Container image "ghcr.io/excalibur-enterprise/api:<version>" already present on machine
      Normal  Created    4d    kubelet            Created container api
      Normal  Started    4d    kubelet            Started container api
    
  • stdout/stderr from the previous container instance (useful when the pod has already restarted):

    kubectl logs <pod-name> -n <namespace> --previous
    

    Expected output:

    {"level":"info","timestamp":"2026-04-20T12:00:01.234Z","message":"Starting API server on port 8080"}
    {"level":"info","timestamp":"2026-04-20T12:00:02.567Z","message":"Connected to database"}
    {"level":"error","timestamp":"2026-04-20T14:32:18.901Z","message":"Out of memory: killed process"}
    

Tenant Workloads

Excalibur deploys tenant-specific services when tenants are onboarded. They live in the same namespace as the core services. Pod names follow these patterns:

Pattern Purpose
pam-tenant-{N} PAM service instance for tenant N
pam-tenant-{N}-envoy Envoy session protocol proxy for tenant N
guacd-tenant-{N} Apache Guacamole daemon — HTML5 RDP, SSH, VNC sessions
virtual-browser-tenant-{N} Virtual browser instances for tenant N
tunnel-{N}-{M} Tunnel session pods

What to check

Ensure tenant workloads are running and ready after onboarding.

Why it matters

Tenant workloads are isolated. A failure typically affects a single tenant — but repeated failures across tenants indicate infrastructure issues.

Warning signs

  • Tenant pods stuck in Pending.
  • Repeated pod restarts.
  • Unexpected drops in active session counts.

Where to look

  • Grafana — Excalibur Application Metrics dashboard, Active PAM sessions, Active tunnels, and Active sessions panels.
  • List all tenant workload pods and their status:

    kubectl get pods -n <namespace> | grep tenant
    

    Expected output:

    NAME                                              READY   STATUS    RESTARTS   AGE
    guacd-tenant-0-deployment-865ff6fd97-qrb9d        1/1     Running   0          4d
    guacd-tenant-2-deployment-7c549769d7-h5m2p        1/1     Running   0          4d
    guacd-tenant-3-deployment-7655d9d748-qlmwl        1/1     Running   0          4d
    pam-tenant-0-deployment-<hash>                    1/1     Running   0          4d
    pam-tenant-0-envoy-<hash>                         1/1     Running   0          4d
    pam-tenant-2-deployment-<hash>                    1/1     Running   0          4d
    pam-tenant-2-envoy-<hash>                         1/1     Running   0          4d
    virtual-browser-tenant-0-deployment-<hash>        1/1     Running   0          4d
    ...
    
  • Recent scheduling or container failure events for the namespace:

    kubectl get events -n <namespace> --sort-by=.lastTimestamp
    

    Expected output:

    LAST SEEN   TYPE      REASON                  OBJECT                                          MESSAGE
    5m          Normal    Scheduled               pod/pam-tenant-0-deployment-<hash>               Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node>
    5m          Normal    Pulled                  pod/pam-tenant-0-deployment-<hash>               Container image already present on machine
    5m          Normal    Created                 pod/pam-tenant-0-deployment-<hash>               Created container pam
    5m          Normal    Started                 pod/pam-tenant-0-deployment-<hash>               Started container pam
    3m          Warning   FailedScheduling        pod/virtual-browser-tenant-1-deployment-<hash>  0/6 nodes are available: insufficient resources. preemption: 0/6 nodes are available: No preemption victims found for incoming pod.
    

Persistent Storage

Excalibur relies on persistent storage for several critical components. The following volumes are present in all standard deployments:

Volume Purpose Default size
database-data Primary database storage 10 Gi
excalibur-data PAM session recordings and shared application data — typically the largest and fastest-growing volume 10 Gi
backup-repository Backup storage 10 Gi
certificates, keystore Certificate and key material 10 Mi each
dashboard-static-files, pam-client-static-files Static frontend assets 100 Mi each
squid-spool Proxy cache 100 Mi
prometheus-data, loki-data Observability metrics and log storage 1 Gi each
grafana-data Grafana configuration and state 100 Mi

All sizes are configurable in the Helm values. Production deployments should review and adjust them based on expected load — especially excalibur-data, database-data, and backup-repository.

What to check

Ensure persistent volumes are bound and have sufficient free capacity.

Why it matters

When a volume fills up, the service writing to it stops accepting new writes. Persistent storage is shared across replicas of stateful services — a full volume affects the entire service, not just one pod.

Warning signs

  • PVC usage approaching capacity.
  • Services reporting write errors.
  • Storage growing continuously without cleanup.

Where to look

  • Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels.
  • PVC bound status and capacity:

    kubectl get pvc -n <namespace>
    

    Expected output:

    NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    backup-repository          Bound    pvc-756e41a4-c0ef-42cf-8884-ff7371499981   2Gi        RWX            azurefile      139d
    certificates               Bound    pvc-d88fad7b-40dd-429c-bf32-43b7865d5ea1   10Mi       RWX            azurefile      139d
    dashboard-static-files     Bound    pvc-e14b3b2c-0e14-40fe-b133-66113c990f12   100Mi      RWX            azurefile      139d
    database-data-database-0   Bound    pvc-fef572eb-c90c-4846-b269-e096186d10e0   1Gi        RWO            managed        139d
    database-data-database-1   Bound    pvc-a7e03a0a-77f7-431b-98ae-74fe3e261782   1Gi        RWO            managed        139d
    database-data-database-2   Bound    pvc-f0db59e2-3aff-4b51-8c1b-9b45d4ce495b   1Gi        RWO            managed        139d
    excalibur-data             Bound    pvc-d114890f-3068-418d-9712-5706435dd947   10Gi       RWX            azurefile      71d
    grafana-data               Bound    pvc-87bcbcdb-0027-45ac-8cbe-2ef414ef4995   1Gi        RWO            managed        139d
    keystore                   Bound    pvc-80804302-82e8-4fcd-aca9-097bc09855ee   10Mi       RWX            azurefile      139d
    loki-data                  Bound    pvc-9ea26b34-02a9-48b0-b1d5-def6d637fc79   1Gi        RWO            managed        139d
    pam-client-static-files    Bound    pvc-b8cddc68-ba96-4dc3-8860-f3e45cf17085   100Mi      RWX            azurefile      139d
    prometheus-data            Bound    pvc-a59830b3-b776-4358-939d-823928617a2e   5Gi        RWO            managed        139d
    squid-spool                Bound    pvc-3342cbc8-ce4c-4eb4-98ee-73cc2c55006a   100Mi      RWO            azurefile      139d
    
  • Volume binding events for a specific PVC:

    kubectl describe pvc <name> -n <namespace>
    

    Expected output (relevant sections):

    Status:        Bound
    Capacity:      10Gi
    Access Modes:  RWX
    VolumeMode:    Filesystem
    Events:        <none>
    

Backup and Data Protection

The backup pod runs a scheduled, automated backup of Excalibur application data into the backup-repository PVC. The schedule and retention policy are configurable through Helm values; the chart ships with conservative defaults suitable for evaluation. Review and tune both for your production recovery requirements.

What each backup run includes

  • A consistent dump of the application database (when a database backup target is configured).
  • A snapshot of Excalibur's persistent application data — PAM session recordings, certificates, keystore, and supporting files.

Snapshots are deduplicated and compressed inside the backup-repository PVC. The repository enforces a rolling retention policy across multiple tiers (most recent, hourly, daily, weekly); periodic maintenance prunes expired snapshots automatically.

Tune the defaults

The default backup schedule, retention policy, and backup-repository PVC size are intentionally modest. For production, set both the retention window and the PVC size to match your Recovery Point Objective (RPO) and the volume of session recordings you generate. Consider replicating the backup-repository contents to off-cluster storage for disaster recovery.

What to check

Confirm that backup runs are completing successfully and that the backup-repository volume has sufficient free capacity.

Why it matters

Backups are the primary recovery mechanism. A missed or failed backup run may go unnoticed until a restore is needed.

Warning signs

  • Backup pod logs reporting database dump or snapshot creation failures.
  • No new snapshots appearing within the configured backup interval.
  • backup-repository PVC usage growing continuously without cleanup.
  • The backup repository is unreadable or snapshot listing returns errors.

Where to look

  • Grafana — Explore with product = excalibur-v4, appName = backup.
  • Grafana — Excalibur Kubernetes Metrics dashboard, Persistent Volume Usage panels for backup-repository.
  • Backup pod status and recent log lines:

    kubectl get pods -n <namespace> -l app=backup
    

    Expected output:

    NAME                     READY   STATUS    RESTARTS   AGE
    backup-8d5fbcc56-47chb   1/1     Running   0          4d
    

    Then tail recent logs:

    kubectl logs -n <namespace> -l app=backup --tail=50
    

    Expected output:

    Dumping database to /volumes/database/dump.sql ...
    Database dump completed successfully.
    Creating snapshot of /volumes ...
    Snapshot created successfully.
    

Observability Stack

Excalibur deployments include a built-in observability stack deployed as part of the Helm chart:

Component Role
prometheus Metrics collection and alerting
grafana Dashboards; displays Prometheus alerts
loki Log aggregation
fluent-bit DaemonSet (one pod per node) — forwards container stdout/stderr to Loki

Prometheus scrape jobs

Job Source
prometheus Prometheus self-metrics
kubelet-cadvisor Container CPU and memory metrics from cAdvisor
kubelet-volumes PVC usage metrics from kubelet
kubernetes-pods Infrastructure metrics for Grafana, Loki, and Prometheus pods
excalibur Application metrics from Excalibur services exposing a /metrics endpoint

Retention

Metrics and logs are retained for a short, bounded window by default — sufficient for live operational use, but not sufficient for long-term forensic or compliance needs.

Tune retention for production

The default Prometheus and Loki retention windows are intentionally short to keep the observability PVCs small. Increase them — and the corresponding prometheus-data and loki-data PVC sizes — to match your incident-response and audit requirements. For long-term retention or compliance archiving, forward metrics and logs to an external system (for example, a remote Prometheus, Loki, or SIEM).

What to check

Ensure monitoring components are running and collecting data.

Why it matters

Monitoring failures create blind spots — issues become invisible until users report them.

Warning signs

  • Grafana dashboards showing no data.
  • Prometheus targets failing.
  • Logs missing for active services.

Where to look

  • Grafana — Excalibur Kubernetes Metrics dashboard, Metrics samples panel. All five scrape jobs (prometheus, kubelet-cadvisor, kubelet-volumes, kubernetes-pods, excalibur) should appear as UP.
  • Inspect Prometheus scrape targets directly — see Port-forward Prometheus.
  • Verify one fluent-bit pod per node:

    kubectl get pods -n <namespace> -l app=fluent-bit
    

    Expected output:

    NAME               READY   STATUS    RESTARTS   AGE
    fluent-bit-2ks9z   1/1     Running   0          4d
    fluent-bit-9rwvv   1/1     Running   0          4d
    fluent-bit-nvfqb   1/1     Running   0          4d
    fluent-bit-pjb2c   1/1     Running   0          4d
    fluent-bit-v79gn   1/1     Running   0          4d
    
  • Grafana Exploreproduct = excalibur-v4 in Loki — confirm log streams exist for all active services.

Sharing logs with Excalibur support

To export application logs from Loki and share them with the Excalibur support team, use Excalibur Chronicler. See Collect diagnostic data for full instructions, including encrypted exports.


TLS and Ingress

Excalibur is exposed through a Kubernetes Ingress named proxy, which routes traffic to the proxy service on port 8000. TLS is managed by cert-manager using a Certificate resource named <hostname>-tls (for example excalibur.xclbr.com-tls), stored in a Secret of the same name.

The default ClusterIssuer is letsencrypt-production. Let's Encrypt issues a 90-day certificate using the HTTP-01 ACME challenge, and cert-manager renews it automatically about 30 days before expiry.

What to check

Confirm the ingress endpoint is reachable and TLS certificates are valid.

Why it matters

Ingress and TLS are the entry point for every user. Even with multiple replicas inside the cluster, an ingress misconfiguration or expired certificate makes the platform unreachable from outside. The ingress controller itself should also be highly available — confirm it with your platform team.

Warning signs

  • TLS certificates nearing expiration.
  • HTTP 502 or 503 responses.
  • Browser security warnings.

Where to look

  • Verify the ingress has an external address assigned:

    kubectl get ingress proxy -n <namespace>
    

    Expected output:

    NAME    CLASS                                HOSTS               ADDRESS         PORTS     AGE
    proxy   webapprouting.kubernetes.azure.com   <excalibur-hostname> <ingress-ip>    80, 443   139d
    
  • Verify Ready: True under Conditions and check Not After under Status:

    kubectl describe certificate <hostname>-tls -n <namespace>
    

    Expected output (relevant sections):

    Status:
      Conditions:
        Type:                  Ready
        Status:                True
        Message:               Certificate is up to date and has not expired
      Not After:               2026-07-20T10:15:00Z
      Not Before:              2026-04-21T09:15:00Z
      Renewal Time:            2026-06-19T10:15:00Z
    
  • Confirm the TLS Secret exists and is populated:

    kubectl get secret <hostname>-tls -n <namespace>
    

    Expected output:

    NAME                       TYPE                DATA   AGE
    <excalibur-hostname>-tls   kubernetes.io/tls   2      139d
    
  • External connectivity check:

    curl -I https://<excalibur-hostname>/healthz
    

    Expected output:

    HTTP/2 200
    date: Mon, 21 Apr 2026 12:26:43 GMT
    content-type: text/html
    content-length: 2277
    

Scheduled Jobs and Service Restarts

The Excalibur Helm chart does not install Kubernetes CronJob resources. The only automated scheduled process is the backup cron running inside the backup pod (see Backup and data protection). Any other CronJob resources in the namespace were added by the operator.

Rolling restart to reclaim memory

If pods accumulate memory over time, a rolling restart returns usage to baseline without a service outage:

kubectl rollout restart deployment/<service-name> -n <namespace>

What to check

If operator-defined CronJob resources are present, confirm they execute on schedule and complete successfully.

Why it matters

Long-running services such as pam or core can accumulate memory. A rolling restart reclaims memory without requiring a full outage.

Warning signs

  • Failed CronJob executions.
  • Jobs not running on schedule.
  • Memory usage rising steadily across specific pods.

Where to look

  • List scheduled jobs in the namespace:

    kubectl get cronjob -n <namespace>
    

    Expected output:

    No resources found in <namespace> namespace.
    

    If no CronJob resources exist, this is the default state — the chart does not install any.

  • Recent job run status:

    kubectl get jobs -n <namespace>
    

    Expected output:

    No resources found in <namespace> namespace.
    

    If no Job resources exist, no CronJobs have fired recently.

  • Grafana — Excalibur Kubernetes Metrics dashboard, Memory by pod panel — confirm memory drops after each scheduled restart.