Skip to content

Cluster Health Signals

Even when Excalibur services look healthy, cluster-level issues can degrade the platform. This page describes the cluster signals the Excalibur operator should monitor alongside Excalibur workloads.

Prerequisites

All commands assume kubectl access to the cluster. Replace <namespace> with your Excalibur deployment's namespace.

Node Health

What to check

Ensure cluster nodes are healthy, available, and have spare capacity to absorb a node failure.

Why it matters

Healthy nodes are the foundation for HA — Kubernetes can only reschedule a failed pod onto another node if a healthy node has capacity. A single NotReady node in a multi-node cluster is usually absorbed transparently because replicas elsewhere keep serving traffic. Multiple unhealthy nodes, or a single-node cluster losing its only node, causes a hard outage.

Warning signs

  • Nodes reporting NotReady.
  • Nodes under DiskPressure or MemoryPressure.
  • Unexpected pod evictions.

Where to look

  • Check for NotReady status:

    kubectl get nodes
    

    Expected output:

    NAME                                 STATUS   ROLES    AGE     VERSION
    aks-systempool-32246347-vmss000000   Ready    <none>   4d      v1.33.7
    aks-userpool-26910173-vmss000004     Ready    <none>   4d      v1.33.7
    aks-userpool-26910173-vmss000057     Ready    <none>   4d      v1.33.7
    aks-userpool-26910173-vmss00007g     Ready    <none>   4d      v1.33.7
    aks-userpool-26910173-vmss00008e     Ready    <none>   4d      v1.33.7
    aks-userpool-26910173-vmss00008h     Ready    <none>   4d      v1.33.7
    
  • Inspect Conditions (DiskPressure, MemoryPressure, PIDPressure) and recent events:

    kubectl describe node <node-name>
    

    Expected output (relevant sections):

    Conditions:
      Type                 Status  LastHeartbeatTime                 Reason                          Message
      ----                 ------  -----------------                 ------                          -------
      MemoryPressure       False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
      DiskPressure         False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
      PIDPressure          False   Tue, 21 Apr 2026 15:16:20 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
      Ready                True    Tue, 21 Apr 2026 15:16:20 +0000   KubeletReady                    kubelet is posting ready status
    

Platform responsibility

Node lifecycle management is typically handled by cluster administrators. If the cluster uses a node autoscaler, node count changes are also managed by the platform team — sudden node removal can trigger pod rescheduling and may temporarily affect availability.


Resource Usage

What to check

Monitor CPU and memory usage trends across workloads.

Why it matters

High resource usage causes pods to restart or degrade in performance. Trends often appear days before a hard failure.

Warning signs

  • Pods terminated due to OOMKilled.
  • Containers operating near their resource limits.
  • Pods stuck in Pending.

Where to look

  • Grafana — Excalibur Kubernetes Metrics dashboard, CPU seconds by container and Memory by pod panels.
  • Live per-pod resource usage:

    kubectl top pods -n <namespace> --sort-by=memory
    

    Expected output:

    NAME                                                    CPU(cores)   MEMORY(bytes)
    virtual-browser-tenant-0-deployment-5bdc5c4dc9-6c2fx    2m           737Mi
    tunnel-52-81-9866b7676-xzs6g                            26m          349Mi
    prometheus-7cbcfb4576-mdx7n                             10m          329Mi
    database-2                                              30m          253Mi
    core-7b9fdc7888-st8xb                                   145m         235Mi
    api-5884bcbf58-2r6kk                                    6m           208Mi
    loki-fb5687d79-62gzm                                    13m          196Mi
    database-1                                              32m          179Mi
    database-0                                              32m          170Mi
    squid-5fbd585d86-7nmhp                                  3m           167Mi
    repository-77dd849d8d-bfw7f                             8m           163Mi
    ...
    
  • Under each container, Last State shows the termination reason and exit code from the previous run. OOMKilled here means the container exceeded its memory limit:

    kubectl describe pod <pod-name> -n <namespace>
    

    Expected output (relevant sections):

    Containers:
      api:
        State:          Running
          Started:      Mon, 21 Apr 2026 09:15:42 +0000
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Sun, 20 Apr 2026 12:00:03 +0000
          Finished:     Mon, 21 Apr 2026 09:15:30 +0000
        Ready:          True
        Restart Count:  1
    

Pod Scheduling

What to check

Ensure new pods start successfully.

Warning signs

  • Pods stuck in Pending.
  • Scheduling errors in pod events.

Common causes:

  • Insufficient node capacity.
  • Image pull failures.
  • Volume binding issues.

Where to look

  • List stuck pods:

    kubectl get pods -n <namespace> --field-selector=status.phase=Pending
    

    Expected output:

    No resources found in <namespace> namespace.
    

    The expected steady-state result is No resources found. Any pod listed here requires investigation.

  • Inspect the Events section for scheduling failure messages:

    kubectl describe pod <pod-name> -n <namespace>
    

    Expected output (relevant sections):

    Events:
      Type     Reason             Age    From               Message
      ----     ------             ----   ----               -------
      Warning  FailedScheduling   5m     default-scheduler  0/3 nodes are available: 2 Too many pods. preemption: 0/3 nodes are available: 2 No preemption victims found for incoming pod.
    

Log Collection

Excalibur services push structured logs directly to Loki over HTTP (default: http://loki:3100). Console logging to stdout is disabled by default. This means Loki and Grafana are the primary log store — they hold the configured window of structured log history for all services.

In addition, fluent-bit runs as a DaemonSet on every cluster node and forwards container stdout and stderr to Loki with the level = console label. This captures anything written outside the structured logger — startup output and fatal crashes before the logger initialized.

Tune Loki retention for your needs

The default Loki retention is short and designed for live operational use, not long-term forensics or compliance. Increase retention — and the loki-data PVC size — to match your incident-response window. For long-term archiving, forward logs to an external system such as a SIEM.

kubectl logs shows only the stdout/stderr buffer Kubernetes keeps on the node. Because console logging is off by default, this buffer is mostly empty during normal operation. It is still useful for crash output before the logger started, or as a fallback when Loki is unreachable.

What to check

Ensure fluent-bit is running on every node and structured logs are arriving in Loki.

Warning signs

  • Grafana Explore returns no results for product = excalibur-v4 while pods are running.
  • Log volume in Loki drops to zero for services that should be active.
  • Fluent-bit pods not running or restarting.

Where to look

  • Verify one fluent-bit pod per node is Running:

    kubectl get pods -n <namespace> -l app=fluent-bit
    

    Expected output:

    NAME               READY   STATUS    RESTARTS   AGE
    fluent-bit-2ks9z   1/1     Running   0          4d
    fluent-bit-9rwvv   1/1     Running   0          4d
    fluent-bit-nvfqb   1/1     Running   0          4d
    fluent-bit-pjb2c   1/1     Running   0          4d
    fluent-bit-v79gn   1/1     Running   0          4d
    
  • Check for log shipping errors from fluent-bit itself:

    kubectl logs -n <namespace> -l app=fluent-bit --tail=30
    

    Expected output:

    [2026/04/21 12:00:01] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204
    [2026/04/21 12:00:06] [ info] [output:loki:loki.0] loki.0, entity=https://loki:3100/loki/api/v1/push, http_status=204
    
  • Grafana Exploreproduct = excalibur-v4 in Loki — confirm log streams are present for all active services.


DNS and Service Connectivity

What to check

Ensure Excalibur services can resolve and reach internal Kubernetes service names.

Why it matters

Excalibur services rely on Kubernetes DNS for service discovery. DNS issues often appear as application failures even when the pods themselves are healthy.

Warning signs

  • Multiple services reporting connection errors at the same time.
  • Errors indicating failed name resolution in application logs.
  • Services unable to reach internal dependencies.
  • Sudden failures across several components simultaneously.

Where to look

  • Test in-cluster DNS resolution from within the Excalibur namespace:

    kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -n <namespace> -- nslookup core
    

    Expected output:

    Server:    <kube-dns-ip>
    Address:   <kube-dns-ip>:53
    
    Name:      core.<namespace>.svc.cluster.local
    Address:   <cluster-ip>
    
  • Grafana Explore — check for correlated connection errors across multiple Excalibur services in Loki.

  • Look for repeated connection or resolution failures:

    kubectl get events -n <namespace> --sort-by=.lastTimestamp
    

    Expected output:

    LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE
    5m          Normal    Scheduled           pod/pam-tenant-0-deployment-<hash>               Successfully assigned <namespace>/pam-tenant-0-deployment-<hash> to <node>
    5m          Normal    Pulled              pod/pam-tenant-0-deployment-<hash>               Container image already present on machine
    3m          Normal    Created             pod/pam-tenant-0-deployment-<hash>               Created container pam
    

Escalate cluster-wide DNS failures

Cluster DNS components (such as CoreDNS) are typically managed by the Kubernetes platform team. If multiple services experience name resolution failures simultaneously, escalate to cluster administrators.