Service Down

🚫

Severity: Critical | Alert Threshold: Health check failures for 2+ minutes

Overview

This alert triggers when the service health endpoints fail to respond for more than 2 consecutive minutes.

Immediate Actions

⚠️

First 5 Minutes: Focus on restoration, not root cause analysis.

Check if this is a global outage or regional
Attempt service restart
Communicate to stakeholders

Diagnostic Steps

Verify the Outage

# Check all health endpoints
curl -w "%{http_code}" https://demo.thinkhive.ai/health/live
curl -w "%{http_code}" https://demo.thinkhive.ai/health/ready
curl -w "%{http_code}" https://app.thinkhive.ai/health/live

Check Cloud Run Status

# View service status
gcloud run services describe thinkhive-demo --region us-central1
 
# Check for recent errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit 20 --freshness=10m

Check Instance Count

# See if instances are running
gcloud run services describe thinkhive-demo \
  --region us-central1 \
  --format "value(status.traffic)"

Check Dependencies

Service	Status Page
Google Cloud	https://status.cloud.google.com
Neon Database	https://neonstatus.com
Auth0	https://status.auth0.com
OpenAI	https://status.openai.com

Remediation

When: Service is unresponsive but infrastructure is healthy

# Force new revision deployment
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars FORCE_RESTART=$(date +%s)

When: Recent deployment caused the outage

# List recent revisions
gcloud run revisions list --service thinkhive-demo --region us-central1
 
# Rollback to last known good revision
gcloud run services update-traffic thinkhive-demo \
  --region us-central1 \
  --to-revisions LAST_GOOD_REVISION=100

When: Traffic spike overwhelmed the service

# Increase max instances
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --max-instances 20
 
# Increase CPU/memory
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --memory 1Gi --cpu 2

When: Regional outage affecting us-central1

# Deploy to backup region (if configured)
gcloud run deploy thinkhive-demo \
  --image gcr.io/PROJECT_ID/thinkhive-demo:latest \
  --region us-east1
 
# Update DNS to point to backup

Communication Template

**Incident**: ThinkHive Service Disruption
**Status**: Investigating / Identified / Resolved
**Impact**: [Describe user impact]
**Start Time**: [UTC timestamp]
**Updates**:
- [Time]: [Update]

Post-Incident

Document timeline of events
Identify root cause
Create tickets for preventive measures
Update runbook if needed
Schedule post-mortem meeting

Prevention

Implement multi-region deployment
Set up synthetic monitoring
Configure proper health checks
Implement graceful degradation
Regular disaster recovery drills

High Error Rate High Latency