High Error Rate
🚫
Severity: High | Alert Threshold: Error rate > 5% over 5 minutes
Overview
This alert triggers when the percentage of HTTP 5xx responses exceeds 5% of total requests over a 5-minute window.
Symptoms
- Increased 5xx HTTP responses
- User-facing errors in the application
- Failed API calls from SDKs
- Elevated error logs in Cloud Logging
Impact Assessment
| Error Rate | Impact Level | Action |
|---|---|---|
| 5-10% | Medium | Investigate, may self-resolve |
| 10-25% | High | Immediate investigation required |
| > 25% | Critical | All hands, consider rollback |
Diagnostic Steps
Check Recent Deployments
# List recent Cloud Run revisions
gcloud run revisions list --service thinkhive-demo --region us-central1 --limit 5If a recent deployment correlates with the error spike, consider rollback.
Analyze Error Logs
# View recent errors in Cloud Logging
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit 50 \
--format "table(timestamp, textPayload)"Look for patterns:
- Specific endpoints failing
- Database connection errors
- External service failures
Check Service Health
# Health check endpoints
curl https://demo.thinkhive.ai/health/live
curl https://demo.thinkhive.ai/health/readyReview Metrics Dashboard
- Open Cloud Monitoring
- Check request latency correlation
- Check memory/CPU usage
- Check database connection pool
Common Causes & Remediation
Symptoms: Connection timeout errors, query failures
Fix:
# Check database status
# Visit Neon console: https://console.neon.tech
# Restart service to reset connection pool
gcloud run services update thinkhive-demo \
--region us-central1 \
--update-env-vars RESTART_TRIGGER=$(date +%s)Rollback Procedure
If errors are caused by a recent deployment:
# List revisions to find stable version
gcloud run revisions list --service thinkhive-demo --region us-central1
# Route 100% traffic to stable revision
gcloud run services update-traffic thinkhive-demo \
--region us-central1 \
--to-revisions thinkhive-demo-STABLE_REVISION=100Prevention
- Implement comprehensive error handling
- Add circuit breakers for external dependencies
- Set up canary deployments
- Increase test coverage for edge cases
- Monitor error budgets