RunbooksOverview

Operational Runbooks

These runbooks provide step-by-step guidance for responding to ThinkHive platform alerts. Each runbook covers detection, diagnosis, and remediation procedures.

Audience: These runbooks are for self-hosted ThinkHive operators and ThinkHive platform engineers. If you use the managed service at app.thinkhive.ai, you don’t need to worry about these — the ThinkHive team handles infrastructure operations for you.

On-Call Engineers: Bookmark this page for quick access during incidents.

Alert Categories

Performance Alerts

Resource Alerts

Availability Alerts

Traffic Alerts

General Incident Response

1. Acknowledge the Alert

  • Check the alert in your monitoring dashboard
  • Acknowledge to prevent duplicate notifications
  • Note the start time for incident timeline

2. Assess Impact

  • Severity: Is this affecting users?
  • Scope: Single service or multiple?
  • Duration: How long has this been occurring?

3. Communicate

  • Update incident channel (Slack/Teams)
  • Notify stakeholders if customer-impacting
  • Keep status page updated

4. Diagnose & Remediate

  • Follow the specific runbook for the alert type
  • Document actions taken
  • Escalate if needed

5. Post-Incident

  • Write incident report
  • Identify root cause
  • Create follow-up tickets for prevention
ResourceDescription
Cloud Run ConsoleService management
Cloud LoggingLog analysis
Cloud MonitoringMetrics dashboard
Database StudioNeon PostgreSQL console

Escalation Contacts

RoleContact
Platform Teamplatform@thinkhive.ai
Database Teamdatabase@thinkhive.ai
Security Teamsecurity@thinkhive.ai