RunbooksCircuit Breaker Open

Circuit Breaker Open

⚠️

Severity: Medium-High | Alert Threshold: Circuit breaker opened for any service

Overview

This alert triggers when a circuit breaker opens due to repeated failures to an external dependency. The circuit breaker pattern prevents cascading failures by stopping requests to unhealthy services.

What Circuit Breakers Protect

ServicePurposeImpact When Open
OpenAILLM analysisExplainability degraded
Auth0AuthenticationNew logins fail
Neon DBData storageCore functionality impacted
PineconeVector searchSemantic search unavailable

Diagnostic Steps

Identify Which Circuit Opened

# Check logs for circuit breaker events
gcloud logging read 'textPayload=~"circuit" OR textPayload=~"breaker"' \
  --limit 30 \
  --freshness=30m

Check Dependency Status

ServiceStatus Page
OpenAIhttps://status.openai.com
Auth0https://status.auth0.com
Neonhttps://neonstatus.com
Pineconehttps://status.pinecone.io

Review Failure Patterns

# Look for the failures that triggered the circuit
gcloud logging read 'severity>=ERROR' --limit 50 --freshness=30m

Check Network Issues

# Test connectivity from Cloud Run
# Deploy a test container or check Cloud Run logs for timeout patterns
gcloud logging read 'textPayload=~"ETIMEDOUT" OR textPayload=~"ECONNREFUSED"' --limit 20

Remediation by Service

Symptoms: Explainability features returning errors

Actions:

  1. Check https://status.openai.com
  2. Verify API key is valid
  3. Check rate limits
  4. Consider fallback to cached responses

Manual Reset:

# Restart service to reset circuit
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars CIRCUIT_RESET=$(date +%s)

Circuit Breaker States

CLOSED (normal)
    ↓ failures exceed threshold
OPEN (blocking requests)
    ↓ timeout period expires
HALF-OPEN (testing)
    ↓ success → CLOSED
    ↓ failure → OPEN

Manual Circuit Reset

If the dependency is confirmed healthy but circuit remains open:

# Force restart to reset all circuits
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars FORCE_RESTART=$(date +%s)

Graceful Degradation

When circuits open, the service should:

  • Return cached data if available
  • Show user-friendly error messages
  • Log detailed failure information
  • Not cascade to other services

Prevention

  • Set appropriate timeout values
  • Configure retry with exponential backoff
  • Monitor dependency health proactively
  • Implement fallback responses
  • Use multiple providers where possible
  • Regular dependency health checks