RunbooksHigh Error Rate

High Error Rate

🚫

Severity: High | Alert Threshold: Error rate > 5% over 5 minutes

Overview

This alert triggers when the percentage of HTTP 5xx responses exceeds 5% of total requests over a 5-minute window.

Symptoms

  • Increased 5xx HTTP responses
  • User-facing errors in the application
  • Failed API calls from SDKs
  • Elevated error logs in Cloud Logging

Impact Assessment

Error RateImpact LevelAction
5-10%MediumInvestigate, may self-resolve
10-25%HighImmediate investigation required
> 25%CriticalAll hands, consider rollback

Diagnostic Steps

Check Recent Deployments

# List recent Cloud Run revisions
gcloud run revisions list --service thinkhive-demo --region us-central1 --limit 5

If a recent deployment correlates with the error spike, consider rollback.

Analyze Error Logs

# View recent errors in Cloud Logging
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit 50 \
  --format "table(timestamp, textPayload)"

Look for patterns:

  • Specific endpoints failing
  • Database connection errors
  • External service failures

Check Service Health

# Health check endpoints
curl https://demo.thinkhive.ai/health/live
curl https://demo.thinkhive.ai/health/ready

Review Metrics Dashboard

  1. Open Cloud Monitoring
  2. Check request latency correlation
  3. Check memory/CPU usage
  4. Check database connection pool

Common Causes & Remediation

Symptoms: Connection timeout errors, query failures

Fix:

# Check database status
# Visit Neon console: https://console.neon.tech
 
# Restart service to reset connection pool
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars RESTART_TRIGGER=$(date +%s)

Rollback Procedure

If errors are caused by a recent deployment:

# List revisions to find stable version
gcloud run revisions list --service thinkhive-demo --region us-central1
 
# Route 100% traffic to stable revision
gcloud run services update-traffic thinkhive-demo \
  --region us-central1 \
  --to-revisions thinkhive-demo-STABLE_REVISION=100

Prevention

  • Implement comprehensive error handling
  • Add circuit breakers for external dependencies
  • Set up canary deployments
  • Increase test coverage for edge cases
  • Monitor error budgets