Skip to main content

Monitoring

PDB Operator includes a comprehensive observability stack with Prometheus metrics, OpenTelemetry tracing, and structured logging.

Prometheus Metrics

All metrics are exposed on the metrics endpoint (default :8443 over HTTPS) and can be scraped by Prometheus via the ServiceMonitor.

MetricTypeDescription
pdb_operator_reconciliation_duration_secondsHistogramReconciliation duration
pdb_operator_reconciliation_errors_totalCounterReconciliation errors
pdb_operator_pdbs_created_totalCounterPDBs created
pdb_operator_pdbs_updated_totalCounterPDBs updated
pdb_operator_pdbs_deleted_totalCounterPDBs deleted
pdb_operator_deployments_managedGaugeManaged deployments per namespace/class
pdb_operator_policies_activeGaugeActive policies per namespace
pdb_operator_compliance_statusGaugeDeployment compliance status
pdb_operator_maintenance_window_activeGaugeMaintenance window active
pdb_operator_enforcement_decisions_totalCounterEnforcement decisions
pdb_operator_override_attempts_totalCounterOverride attempts

See the Metrics Reference for full label details and sample queries.

Enable ServiceMonitor

With Helm:

helm upgrade pdb-operator oci://ghcr.io/pdb-operator/charts/pdb-operator \
--namespace pdb-operator-system \
--set serviceMonitor.enabled=true

Enable Prometheus Alerting Rules

The Helm chart includes 12 alert groups covering operator health, performance, circuit breaker, compliance, workqueue depth, resources, and more:

helm upgrade pdb-operator oci://ghcr.io/pdb-operator/charts/pdb-operator \
--namespace pdb-operator-system \
--set prometheusRule.enabled=true

OpenTelemetry Tracing

Tracing is enabled by default. Configure the OTLP collector endpoint:

helm upgrade pdb-operator oci://ghcr.io/pdb-operator/charts/pdb-operator \
--namespace pdb-operator-system \
--set tracing.endpoint=otel-collector.observability:4317

Or set the OTLP_ENDPOINT environment variable directly:

extraEnv:
- name: OTLP_ENDPOINT
value: "otel-collector.observability:4317"

Traces are exported via OTLP/gRPC protocol and include spans for:

  • Policy resolution and evaluation
  • PDB creation, update, and deletion
  • Reconciliation loops with correlation IDs
  • Maintenance window checks

Structured Logging

The operator outputs JSON-formatted structured logs with:

  • Audit trails for policy and PDB changes
  • Correlation IDs and reconcile IDs for request tracing
  • Trace context propagation (W3C Trace Context)

Configure the log level:

helm upgrade pdb-operator oci://ghcr.io/pdb-operator/charts/pdb-operator \
--namespace pdb-operator-system \
--set controller.logLevel=debug

Kubernetes Events

The operator records events on both PDBPolicy and Deployment resources:

EventDescription
PolicyAppliedPolicy successfully applied to workloads
PolicyUpdatedPolicy configuration updated
PolicyRemovedPolicy removed from workloads
PolicyConflictMultiple policies match a deployment
PolicyEnforcedEnforcement mode blocked an override
PDBCreatedNew PDB created for a deployment
PDBUpdatedExisting PDB updated
PDBDeletedPDB removed
DeploymentManagedDeployment is now managed by the operator
DeploymentSkippedDeployment skipped (single replica or no match)
AnnotationAcceptedAnnotation override accepted

Health Endpoints

EndpointPurpose
/healthzLiveness probe, confirms the operator is running
/readyzReadiness probe, confirms caches are synced and ready to serve

Verify Metrics

# Check the metrics service
kubectl get svc -n pdb-operator-system | grep metrics

# Verify ServiceMonitor is picked up by Prometheus
kubectl get servicemonitor -n pdb-operator-system

# Check PrometheusRule alerts
kubectl get prometheusrule -n pdb-operator-system