Cluster Health Signals¶

Even when Excalibur services look healthy, cluster-level issues can degrade the platform. This page describes the cluster signals the Excalibur operator should monitor alongside Excalibur workloads.

Prerequisites

All commands assume kubectl access to the cluster. Replace <namespace> with your Excalibur deployment's namespace.

Node Health¶

What to check¶

Ensure cluster nodes are healthy, available, and have spare capacity to absorb a node failure.

Why it matters¶

Healthy nodes are the foundation for HA — Kubernetes can only reschedule a failed pod onto another node if a healthy node has capacity. A single NotReady node in a multi-node cluster is usually absorbed transparently because replicas elsewhere keep serving traffic. Multiple unhealthy nodes, or a single-node cluster losing its only node, causes a hard outage.

Warning signs¶

Nodes reporting NotReady.
Nodes under DiskPressure or MemoryPressure.
Unexpected pod evictions.

Where to look¶

Check for NotReady status:

kubectl get nodes

Expected output:

NAME                                 STATUS   ROLES    AGE     VERSION
aks-systempool-32246347-vmss000000   Ready    <none>   4d      v1.33.7
aks-userpool-26910173-vmss000004     Ready    <none>   4d      v1.33.7
aks-userpool-26910173-vmss000057     Ready    <none>   4d      v1.33.7
aks-userpool-26910173-vmss00007g     Ready    <none>   4d      v1.33.7
aks-userpool-26910173-vmss00008e     Ready    <none>   4d      v1.33.7
aks-userpool-26910173-vmss00008h     Ready    <none>   4d      v1.33.7

Inspect Conditions (DiskPressure, MemoryPressure, PIDPressure) and recent events:

kubectl describe node <node-name>

Expected output (relevant sections):

Conditions:
  Type                 Status  LastHeartbeatTime                 Reason                          Message
  ----                 ------  -----------------                 ------                          -------
  MemoryPressure       False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure         False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure          False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                True    Tue, 21 Apr 2026 15:16:20 +0000   KubeletReady                    kubelet is posting ready status

Platform responsibility

Node lifecycle management is typically handled by cluster administrators. If the cluster uses a node autoscaler, node count changes are also managed by the platform team — sudden node removal can trigger pod rescheduling and may temporarily affect availability.

Resource Usage¶

What to check¶

Monitor CPU and memory usage trends across workloads.

Why it matters¶

High resource usage causes pods to restart or degrade in performance. Trends often appear days before a hard failure.

Warning signs¶

Pods terminated due to OOMKilled.
Containers operating near their resource limits.
Pods stuck in Pending.

Where to look¶

Grafana — Excalibur Kubernetes Metrics dashboard, CPU seconds by container and Memory by pod panels.

Live per-pod resource usage:

kubectl top pods -n <namespace> --sort-by=memory

Expected output:

NAME                                                    CPU(cores)   MEMORY(bytes)
virtual-browser-tenant-0-deployment-5bdc5c4dc9-6c2fx    2m           737Mi
tunnel-52-81-9866b7676-xzs6g                            26m          349Mi
prometheus-7cbcfb4576-mdx7n                             10m          329Mi
database-2                                              30m          253Mi
core-7b9fdc7888-st8xb                                   145m         235Mi
api-5884bcbf58-2r6kk                                    6m           208Mi
loki-fb5687d79-62gzm                                    13m          196Mi
database-1                                              32m          179Mi
database-0                                              32m          170Mi
squid-5fbd585d86-7nmhp                                  3m           167Mi
repository-77dd849d8d-bfw7f                             8m           163Mi
...

Under each container, Last State shows the termination reason and exit code from the previous run. OOMKilled here means the container exceeded its memory limit:

kubectl describe pod <pod-name> -n <namespace>

Expected output (relevant sections):

Containers:
  api:
    State:          Running
      Started:      Mon, 21 Apr 2026 09:15:42 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sun, 20 Apr 2026 12:00:03 +0000
      Finished:     Mon, 21 Apr 2026 09:15:30 +0000
    Ready:          True
    Restart Count:  1

Pod Scheduling¶

What to check¶

Ensure new pods start successfully.

Warning signs¶

Pods stuck in Pending.
Scheduling errors in pod events.

Common causes:

Insufficient node capacity.
Image pull failures.
Volume binding issues.

Where to look¶

List stuck pods:
```
kubectl get pods -n <namespace> --field-selector=status.phase=Pending
```
Expected output:
```
No resources found in <namespace> namespace.
```
The expected steady-state result is No resources found. Any pod listed here requires investigation.

Inspect the Events section for scheduling failure messages:

kubectl describe pod <pod-name> -n <namespace>

Expected output (relevant sections):

Events:
  Type     Reason             Age    From               Message
  ----     ------             ----   ----               -------
  Warning  FailedScheduling   5m     default-scheduler  0/3 nodes are available: 2 Too many pods. preemption: 0/3 nodes are available: 2 No preemption victims found for incoming pod.

Log Collection¶

Excalibur services push structured logs directly to Loki over HTTP (default: http://loki:3100). Console logging to stdout is disabled by default. This means Loki and Grafana are the primary log store — they hold the configured window of structured log history for all services.

In addition, fluent-bit runs as a DaemonSet on every cluster node and forwards container stdout and stderr to Loki with the level = console label. This captures anything written outside the structured logger — startup output and fatal crashes before the logger initialized.

Tune Loki retention for your needs

The default Loki retention is short and designed for live operational use, not long-term forensics or compliance. Increase retention — and the loki-data PVC size — to match your incident-response window. For long-term archiving, forward logs to an external system such as a SIEM.

kubectl logs shows only the stdout/stderr buffer Kubernetes keeps on the node. Because console logging is off by default, this buffer is mostly empty during normal operation. It is still useful for crash output before the logger started, or as a fallback when Loki is unreachable.

What to check¶

Ensure fluent-bit is running on every node and structured logs are arriving in Loki.

Warning signs¶

Grafana Explore returns no results for product = excalibur-v4 while pods are running.
Log volume in Loki drops to zero for services that should be active.
Fluent-bit pods not running or restarting.

Where to look¶

Verify one fluent-bit pod per node is Running:

kubectl get pods -n <namespace> -l app=fluent-bit

Expected output:

NAME               READY   STATUS    RESTARTS   AGE
fluent-bit-2ks9z   1/1     Running   0          4d
fluent-bit-9rwvv   1/1     Running   0          4d
fluent-bit-nvfqb   1/1     Running   0          4d
fluent-bit-pjb2c   1/1     Running   0          4d
fluent-bit-v79gn   1/1     Running   0          4d

Check for log shipping errors from fluent-bit itself:

kubectl logs -n <namespace> -l app=fluent-bit --tail=30

Expected output:

[2026/04/21 12:00:01] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204
[2026/04/21 12:00:06] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204

Grafana Explore — product = excalibur-v4 in Loki — confirm log streams are present for all active services.

DNS and Service Connectivity¶

What to check¶

Ensure Excalibur services can resolve and reach internal Kubernetes service names.

Why it matters¶

Excalibur services rely on Kubernetes DNS for service discovery. DNS issues often appear as application failures even when the pods themselves are healthy.

Warning signs¶

Multiple services reporting connection errors at the same time.
Errors indicating failed name resolution in application logs.
Services unable to reach internal dependencies.
Sudden failures across several components simultaneously.

Where to look¶

Test in-cluster DNS resolution from within the Excalibur namespace:

kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -n <namespace> -- nslookup core

Expected output:

Server:    <kube-dns-ip>
Address:   <kube-dns-ip>:53

Name:      core.<namespace>.svc.cluster.local
Address:   <cluster-ip>

Grafana Explore — check for correlated connection errors across multiple Excalibur services in Loki.

Look for repeated connection or resolution failures:

kubectl get events -n <namespace> --sort-by=.lastTimestamp

Expected output:

LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE
5m          Normal    Scheduled           pod/pam-tenant-0-deployment-<hash>               Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node>
5m          Normal    Pulled              pod/pam-tenant-0-deployment-<hash>               Container image already present on machine
3m          Normal    Created             pod/pam-tenant-0-deployment-<hash>               Created container pam

Escalate cluster-wide DNS failures

Cluster DNS components (such as CoreDNS) are typically managed by the Kubernetes platform team. If multiple services experience name resolution failures simultaneously, escalate to cluster administrators.