Cluster Health Signals¶
Even when Excalibur services look healthy, cluster-level issues can degrade the platform. This page describes the cluster signals the Excalibur operator should monitor alongside Excalibur workloads.
Prerequisites
All commands assume kubectl access to the cluster. Replace <namespace> with your Excalibur deployment's namespace.
Node Health¶
What to check¶
Ensure cluster nodes are healthy, available, and have spare capacity to absorb a node failure.
Why it matters¶
Healthy nodes are the foundation for HA — Kubernetes can only reschedule a failed pod onto another node if a healthy node has capacity. A single NotReady node in a multi-node cluster is usually absorbed transparently because replicas elsewhere keep serving traffic. Multiple unhealthy nodes, or a single-node cluster losing its only node, causes a hard outage.
Warning signs¶
- Nodes reporting
NotReady. - Nodes under
DiskPressureorMemoryPressure. - Unexpected pod evictions.
Where to look¶
-
Check for
NotReadystatus:kubectl get nodesExpected output:
NAME STATUS ROLES AGE VERSION aks-systempool-32246347-vmss000000 Ready <none> 4d v1.33.7 aks-userpool-26910173-vmss000004 Ready <none> 4d v1.33.7 aks-userpool-26910173-vmss000057 Ready <none> 4d v1.33.7 aks-userpool-26910173-vmss00007g Ready <none> 4d v1.33.7 aks-userpool-26910173-vmss00008e Ready <none> 4d v1.33.7 aks-userpool-26910173-vmss00008h Ready <none> 4d v1.33.7 -
Inspect Conditions (
DiskPressure,MemoryPressure,PIDPressure) and recent events:kubectl describe node <node-name>Expected output (relevant sections):
Conditions: Type Status LastHeartbeatTime Reason Message ---- ------ ----------------- ------ ------- MemoryPressure False Tue, 21 Apr 2026 15:16:20 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 21 Apr 2026 15:16:20 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 21 Apr 2026 15:16:20 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 21 Apr 2026 15:16:20 +0000 KubeletReady kubelet is posting ready status
Platform responsibility
Node lifecycle management is typically handled by cluster administrators. If the cluster uses a node autoscaler, node count changes are also managed by the platform team — sudden node removal can trigger pod rescheduling and may temporarily affect availability.
Resource Usage¶
What to check¶
Monitor CPU and memory usage trends across workloads.
Why it matters¶
High resource usage causes pods to restart or degrade in performance. Trends often appear days before a hard failure.
Warning signs¶
- Pods terminated due to
OOMKilled. - Containers operating near their resource limits.
- Pods stuck in
Pending.
Where to look¶
- Grafana — Excalibur Kubernetes Metrics dashboard, CPU seconds by container and Memory by pod panels.
-
Live per-pod resource usage:
kubectl top pods -n <namespace> --sort-by=memoryExpected output:
NAME CPU(cores) MEMORY(bytes) virtual-browser-tenant-0-deployment-5bdc5c4dc9-6c2fx 2m 737Mi tunnel-52-81-9866b7676-xzs6g 26m 349Mi prometheus-7cbcfb4576-mdx7n 10m 329Mi database-2 30m 253Mi core-7b9fdc7888-st8xb 145m 235Mi api-5884bcbf58-2r6kk 6m 208Mi loki-fb5687d79-62gzm 13m 196Mi database-1 32m 179Mi database-0 32m 170Mi squid-5fbd585d86-7nmhp 3m 167Mi repository-77dd849d8d-bfw7f 8m 163Mi ... -
Under each container, Last State shows the termination reason and exit code from the previous run.
OOMKilledhere means the container exceeded its memory limit:kubectl describe pod <pod-name> -n <namespace>Expected output (relevant sections):
Containers: api: State: Running Started: Mon, 21 Apr 2026 09:15:42 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Sun, 20 Apr 2026 12:00:03 +0000 Finished: Mon, 21 Apr 2026 09:15:30 +0000 Ready: True Restart Count: 1
Pod Scheduling¶
What to check¶
Ensure new pods start successfully.
Warning signs¶
- Pods stuck in
Pending. - Scheduling errors in pod events.
Common causes:
- Insufficient node capacity.
- Image pull failures.
- Volume binding issues.
Where to look¶
-
List stuck pods:
kubectl get pods -n <namespace> --field-selector=status.phase=PendingExpected output:
No resources found in <namespace> namespace.The expected steady-state result is
No resources found. Any pod listed here requires investigation. -
Inspect the Events section for scheduling failure messages:
kubectl describe pod <pod-name> -n <namespace>Expected output (relevant sections):
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 5m default-scheduler 0/3 nodes are available: 2 Too many pods. preemption: 0/3 nodes are available: 2 No preemption victims found for incoming pod.
Log Collection¶
Excalibur services push structured logs directly to Loki over HTTP (default: http://loki:3100). Console logging to stdout is disabled by default. This means Loki and Grafana are the primary log store — they hold the configured window of structured log history for all services.
In addition, fluent-bit runs as a DaemonSet on every cluster node and forwards container stdout and stderr to Loki with the level = console label. This captures anything written outside the structured logger — startup output and fatal crashes before the logger initialized.
Tune Loki retention for your needs
The default Loki retention is short and designed for live operational use, not long-term forensics or compliance. Increase retention — and the loki-data PVC size — to match your incident-response window. For long-term archiving, forward logs to an external system such as a SIEM.
kubectl logs shows only the stdout/stderr buffer Kubernetes keeps on the node. Because console logging is off by default, this buffer is mostly empty during normal operation. It is still useful for crash output before the logger started, or as a fallback when Loki is unreachable.
What to check¶
Ensure fluent-bit is running on every node and structured logs are arriving in Loki.
Warning signs¶
- Grafana Explore returns no results for
product = excalibur-v4while pods are running. - Log volume in Loki drops to zero for services that should be active.
- Fluent-bit pods not running or restarting.
Where to look¶
-
Verify one fluent-bit pod per node is
Running:kubectl get pods -n <namespace> -l app=fluent-bitExpected output:
NAME READY STATUS RESTARTS AGE fluent-bit-2ks9z 1/1 Running 0 4d fluent-bit-9rwvv 1/1 Running 0 4d fluent-bit-nvfqb 1/1 Running 0 4d fluent-bit-pjb2c 1/1 Running 0 4d fluent-bit-v79gn 1/1 Running 0 4d -
Check for log shipping errors from fluent-bit itself:
kubectl logs -n <namespace> -l app=fluent-bit --tail=30Expected output:
[2026/04/21 12:00:01] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204 [2026/04/21 12:00:06] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204 -
Grafana Explore —
product = excalibur-v4in Loki — confirm log streams are present for all active services.
DNS and Service Connectivity¶
What to check¶
Ensure Excalibur services can resolve and reach internal Kubernetes service names.
Why it matters¶
Excalibur services rely on Kubernetes DNS for service discovery. DNS issues often appear as application failures even when the pods themselves are healthy.
Warning signs¶
- Multiple services reporting connection errors at the same time.
- Errors indicating failed name resolution in application logs.
- Services unable to reach internal dependencies.
- Sudden failures across several components simultaneously.
Where to look¶
-
Test in-cluster DNS resolution from within the Excalibur namespace:
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -n <namespace> -- nslookup coreExpected output:
Server: <kube-dns-ip> Address: <kube-dns-ip>:53 Name: core.<namespace>.svc.cluster.local Address: <cluster-ip> -
Grafana Explore — check for correlated connection errors across multiple Excalibur services in Loki.
-
Look for repeated connection or resolution failures:
kubectl get events -n <namespace> --sort-by=.lastTimestampExpected output:
LAST SEEN TYPE REASON OBJECT MESSAGE 5m Normal Scheduled pod/pam-tenant-0-deployment-<hash> Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node> 5m Normal Pulled pod/pam-tenant-0-deployment-<hash> Container image already present on machine 3m Normal Created pod/pam-tenant-0-deployment-<hash> Created container pam
Escalate cluster-wide DNS failures
Cluster DNS components (such as CoreDNS) are typically managed by the Kubernetes platform team. If multiple services experience name resolution failures simultaneously, escalate to cluster administrators.