Service Down
🚫
Severity: Critical | Alert Threshold: Health check failures for 2+ minutes
Overview
This alert triggers when the service health endpoints fail to respond for more than 2 consecutive minutes.
Immediate Actions
⚠️
First 5 Minutes: Focus on restoration, not root cause analysis.
- Check if this is a global outage or regional
- Attempt service restart
- Communicate to stakeholders
Diagnostic Steps
Verify the Outage
# Check all health endpoints
curl -w "%{http_code}" https://demo.thinkhive.ai/health/live
curl -w "%{http_code}" https://demo.thinkhive.ai/health/ready
curl -w "%{http_code}" https://app.thinkhive.ai/health/liveCheck Cloud Run Status
# View service status
gcloud run services describe thinkhive-demo --region us-central1
# Check for recent errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit 20 --freshness=10mCheck Instance Count
# See if instances are running
gcloud run services describe thinkhive-demo \
--region us-central1 \
--format "value(status.traffic)"Check Dependencies
| Service | Status Page |
|---|---|
| Google Cloud | https://status.cloud.google.com |
| Neon Database | https://neonstatus.com |
| Auth0 | https://status.auth0.com |
| OpenAI | https://status.openai.com |
Remediation
When: Service is unresponsive but infrastructure is healthy
# Force new revision deployment
gcloud run services update thinkhive-demo \
--region us-central1 \
--update-env-vars FORCE_RESTART=$(date +%s)Communication Template
**Incident**: ThinkHive Service Disruption
**Status**: Investigating / Identified / Resolved
**Impact**: [Describe user impact]
**Start Time**: [UTC timestamp]
**Updates**:
- [Time]: [Update]Post-Incident
- Document timeline of events
- Identify root cause
- Create tickets for preventive measures
- Update runbook if needed
- Schedule post-mortem meeting
Prevention
- Implement multi-region deployment
- Set up synthetic monitoring
- Configure proper health checks
- Implement graceful degradation
- Regular disaster recovery drills