RunbooksService Down

Service Down

🚫

Severity: Critical | Alert Threshold: Health check failures for 2+ minutes

Overview

This alert triggers when the service health endpoints fail to respond for more than 2 consecutive minutes.

Immediate Actions

⚠️

First 5 Minutes: Focus on restoration, not root cause analysis.

  1. Check if this is a global outage or regional
  2. Attempt service restart
  3. Communicate to stakeholders

Diagnostic Steps

Verify the Outage

# Check all health endpoints
curl -w "%{http_code}" https://demo.thinkhive.ai/health/live
curl -w "%{http_code}" https://demo.thinkhive.ai/health/ready
curl -w "%{http_code}" https://app.thinkhive.ai/health/live

Check Cloud Run Status

# View service status
gcloud run services describe thinkhive-demo --region us-central1
 
# Check for recent errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit 20 --freshness=10m

Check Instance Count

# See if instances are running
gcloud run services describe thinkhive-demo \
  --region us-central1 \
  --format "value(status.traffic)"

Check Dependencies

ServiceStatus Page
Google Cloudhttps://status.cloud.google.com
Neon Databasehttps://neonstatus.com
Auth0https://status.auth0.com
OpenAIhttps://status.openai.com

Remediation

When: Service is unresponsive but infrastructure is healthy

# Force new revision deployment
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars FORCE_RESTART=$(date +%s)

Communication Template

**Incident**: ThinkHive Service Disruption
**Status**: Investigating / Identified / Resolved
**Impact**: [Describe user impact]
**Start Time**: [UTC timestamp]
**Updates**:
- [Time]: [Update]

Post-Incident

  1. Document timeline of events
  2. Identify root cause
  3. Create tickets for preventive measures
  4. Update runbook if needed
  5. Schedule post-mortem meeting

Prevention

  • Implement multi-region deployment
  • Set up synthetic monitoring
  • Configure proper health checks
  • Implement graceful degradation
  • Regular disaster recovery drills