Operational Runbooks

These runbooks provide step-by-step guidance for responding to ThinkHive platform alerts. Each runbook covers detection, diagnosis, and remediation procedures.

Audience: These runbooks are for self-hosted ThinkHive operators and ThinkHive platform engineers. If you use the managed service at app.thinkhive.ai, you don’t need to worry about these — the ThinkHive team handles infrastructure operations for you.

On-Call Engineers: Bookmark this page for quick access during incidents.

Alert Categories

General Incident Response

1. Acknowledge the Alert

Check the alert in your monitoring dashboard
Acknowledge to prevent duplicate notifications
Note the start time for incident timeline

2. Assess Impact

Severity: Is this affecting users?
Scope: Single service or multiple?
Duration: How long has this been occurring?

3. Communicate

Update incident channel (Slack/Teams)
Notify stakeholders if customer-impacting
Keep status page updated

4. Diagnose & Remediate

Follow the specific runbook for the alert type
Document actions taken
Escalate if needed

5. Post-Incident

Write incident report
Identify root cause
Create follow-up tickets for prevention

Quick Links

Resource	Description
Cloud Run Console	Service management
Cloud Logging	Log analysis
Cloud Monitoring	Metrics dashboard
Database Studio	Neon PostgreSQL console

Escalation Contacts

Role	Contact
Platform Team	platform@thinkhive.ai
Database Team	database@thinkhive.ai
Security Team	security@thinkhive.ai

High Error Rate