Operational Runbooks
These runbooks provide step-by-step guidance for responding to ThinkHive platform alerts. Each runbook covers detection, diagnosis, and remediation procedures.
Audience: These runbooks are for self-hosted ThinkHive operators and ThinkHive platform engineers. If you use the managed service at app.thinkhive.ai, you don’t need to worry about these — the ThinkHive team handles infrastructure operations for you.
On-Call Engineers: Bookmark this page for quick access during incidents.
Alert Categories
Performance Alerts
Resource Alerts
Availability Alerts
Traffic Alerts
General Incident Response
1. Acknowledge the Alert
- Check the alert in your monitoring dashboard
- Acknowledge to prevent duplicate notifications
- Note the start time for incident timeline
2. Assess Impact
- Severity: Is this affecting users?
- Scope: Single service or multiple?
- Duration: How long has this been occurring?
3. Communicate
- Update incident channel (Slack/Teams)
- Notify stakeholders if customer-impacting
- Keep status page updated
4. Diagnose & Remediate
- Follow the specific runbook for the alert type
- Document actions taken
- Escalate if needed
5. Post-Incident
- Write incident report
- Identify root cause
- Create follow-up tickets for prevention
Quick Links
| Resource | Description |
|---|---|
| Cloud Run Console | Service management |
| Cloud Logging | Log analysis |
| Cloud Monitoring | Metrics dashboard |
| Database Studio | Neon PostgreSQL console |
Escalation Contacts
| Role | Contact |
|---|---|
| Platform Team | platform@thinkhive.ai |
| Database Team | database@thinkhive.ai |
| Security Team | security@thinkhive.ai |