RunbooksMemory Critical

Memory Critical

🚫

Severity: Critical | Alert Threshold: Memory usage > 95% OR OOM events

Overview

This alert triggers when memory usage reaches critical levels (>95%) or Out-of-Memory (OOM) events occur. This requires immediate action to prevent service disruption.

Immediate Actions

⚠️

Time-Critical: Service may crash within minutes. Act immediately.

Step 1: Increase Memory (Do This First)

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --memory 2Gi

Step 2: Reduce Concurrency

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --concurrency 20

Step 3: Scale Out

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --min-instances 2 \
  --max-instances 20

Diagnostic Steps

Check for OOM Events

# Search for OOM kills
gcloud logging read 'textPayload=~"OOM" OR textPayload=~"out of memory" OR textPayload=~"heap"' \
  --limit 20 \
  --freshness=1h

Identify Memory Consumer

# Check what's using memory
gcloud logging read 'textPayload=~"memory" AND severity>=WARNING' --limit 30

Check Recent Traffic Spike

Traffic spikes can cause memory exhaustion:

  • Large batch uploads
  • Many concurrent analysis requests
  • Evaluation runs on large datasets

Review Recent Deployments

gcloud run revisions list --service thinkhive-demo --region us-central1 --limit 5

Root Cause Analysis

SymptomLikely CauseFix
Gradual increaseMemory leakFind and fix leak, restart
Sudden spikeLarge request/batchAdd request size limits
After deploymentNew code issueRollback
Traffic correlatedUnder-provisionedIncrease memory

Emergency Rollback

If recent deployment is suspected:

# List revisions
gcloud run revisions list --service thinkhive-demo --region us-central1
 
# Rollback to previous stable version
gcloud run services update-traffic thinkhive-demo \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION=100

Recovery Checklist

  • Memory increased to safe level
  • Concurrency reduced if needed
  • Service is stable (check health endpoints)
  • Root cause identified
  • Incident documented
  • Prevention measures planned

Prevention

Request Size Limits

// Add to Express middleware
app.use(express.json({ limit: '10mb' }));
app.use(express.urlencoded({ limit: '10mb', extended: true }));

Memory Monitoring

// Add memory monitoring
setInterval(() => {
  const usage = process.memoryUsage();
  const heapPercent = (usage.heapUsed / usage.heapTotal) * 100;
  if (heapPercent > 90) {
    console.warn(`High heap usage: ${heapPercent.toFixed(1)}%`);
  }
}, 30000);

Graceful Degradation

  • Implement circuit breakers for memory-heavy operations
  • Queue large batch operations
  • Stream responses instead of buffering