Circuit Breaker Open
⚠️
Severity: Medium-High | Alert Threshold: Circuit breaker opened for any service
Overview
This alert triggers when a circuit breaker opens due to repeated failures to an external dependency. The circuit breaker pattern prevents cascading failures by stopping requests to unhealthy services.
What Circuit Breakers Protect
| Service | Purpose | Impact When Open |
|---|---|---|
| OpenAI | LLM analysis | Explainability degraded |
| Auth0 | Authentication | New logins fail |
| Neon DB | Data storage | Core functionality impacted |
| Pinecone | Vector search | Semantic search unavailable |
Diagnostic Steps
Identify Which Circuit Opened
# Check logs for circuit breaker events
gcloud logging read 'textPayload=~"circuit" OR textPayload=~"breaker"' \
--limit 30 \
--freshness=30mCheck Dependency Status
| Service | Status Page |
|---|---|
| OpenAI | https://status.openai.com |
| Auth0 | https://status.auth0.com |
| Neon | https://neonstatus.com |
| Pinecone | https://status.pinecone.io |
Review Failure Patterns
# Look for the failures that triggered the circuit
gcloud logging read 'severity>=ERROR' --limit 50 --freshness=30mCheck Network Issues
# Test connectivity from Cloud Run
# Deploy a test container or check Cloud Run logs for timeout patterns
gcloud logging read 'textPayload=~"ETIMEDOUT" OR textPayload=~"ECONNREFUSED"' --limit 20Remediation by Service
Symptoms: Explainability features returning errors
Actions:
- Check https://status.openai.com
- Verify API key is valid
- Check rate limits
- Consider fallback to cached responses
Manual Reset:
# Restart service to reset circuit
gcloud run services update thinkhive-demo \
--region us-central1 \
--update-env-vars CIRCUIT_RESET=$(date +%s)Circuit Breaker States
CLOSED (normal)
↓ failures exceed threshold
OPEN (blocking requests)
↓ timeout period expires
HALF-OPEN (testing)
↓ success → CLOSED
↓ failure → OPENManual Circuit Reset
If the dependency is confirmed healthy but circuit remains open:
# Force restart to reset all circuits
gcloud run services update thinkhive-demo \
--region us-central1 \
--update-env-vars FORCE_RESTART=$(date +%s)Graceful Degradation
When circuits open, the service should:
- Return cached data if available
- Show user-friendly error messages
- Log detailed failure information
- Not cascade to other services
Prevention
- Set appropriate timeout values
- Configure retry with exponential backoff
- Monitor dependency health proactively
- Implement fallback responses
- Use multiple providers where possible
- Regular dependency health checks