High Error Rate

🚫

Severity: High | Alert Threshold: Error rate > 5% over 5 minutes

Overview

This alert triggers when the percentage of HTTP 5xx responses exceeds 5% of total requests over a 5-minute window.

Symptoms

Increased 5xx HTTP responses
User-facing errors in the application
Failed API calls from SDKs
Elevated error logs in Cloud Logging

Impact Assessment

Error Rate	Impact Level	Action
5-10%	Medium	Investigate, may self-resolve
10-25%	High	Immediate investigation required
> 25%	Critical	All hands, consider rollback

Diagnostic Steps

Check Recent Deployments

# List recent Cloud Run revisions
gcloud run revisions list --service thinkhive-demo --region us-central1 --limit 5

If a recent deployment correlates with the error spike, consider rollback.

Analyze Error Logs

# View recent errors in Cloud Logging
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit 50 \
  --format "table(timestamp, textPayload)"

Look for patterns:

Specific endpoints failing
Database connection errors
External service failures

Check Service Health

# Health check endpoints
curl https://demo.thinkhive.ai/health/live
curl https://demo.thinkhive.ai/health/ready

Review Metrics Dashboard

Open Cloud Monitoring
Check request latency correlation
Check memory/CPU usage
Check database connection pool

Common Causes & Remediation

Symptoms: Connection timeout errors, query failures

Fix:

# Check database status
# Visit Neon console: https://console.neon.tech
 
# Restart service to reset connection pool
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars RESTART_TRIGGER=$(date +%s)

Symptoms: OOM errors, slow responses before failure

Fix:

# Increase memory allocation
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --memory 1Gi

Symptoms: Consistent errors from specific endpoint

Fix:

# Rollback to previous revision
gcloud run services update-traffic thinkhive-demo \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION=100

Rollback Procedure

If errors are caused by a recent deployment:

# List revisions to find stable version
gcloud run revisions list --service thinkhive-demo --region us-central1
 
# Route 100% traffic to stable revision
gcloud run services update-traffic thinkhive-demo \
  --region us-central1 \
  --to-revisions thinkhive-demo-STABLE_REVISION=100

Prevention

Implement comprehensive error handling
Add circuit breakers for external dependencies
Set up canary deployments
Increase test coverage for edge cases
Monitor error budgets

Overview Service Down

High Error Rate

Overview

Symptoms

Impact Assessment

Diagnostic Steps

Check Recent Deployments

Analyze Error Logs

Check Service Health

Review Metrics Dashboard

Common Causes & Remediation

Rollback Procedure

Prevention

Related Runbooks