RunbooksEvent Loop Lag

Event Loop Lag

⚠️

Severity: High | Alert Threshold: Event loop lag > 100ms for 2+ minutes

Overview

This alert triggers when the Node.js event loop is blocked for extended periods, causing all requests to queue and response times to degrade.

Understanding Event Loop Lag

The event loop is Node.js’s mechanism for handling asynchronous operations. When it’s blocked:

  • All incoming requests queue up
  • Response times increase dramatically
  • Health checks may fail
  • Service appears unresponsive

Diagnostic Steps

Check for Blocking Operations

# Look for synchronous operation warnings
gcloud logging read 'textPayload=~"sync" OR textPayload=~"blocking"' --limit 20

Identify Heavy Computations

Common blockers in ThinkHive:

  • Large JSON parsing (traces with many spans)
  • Synchronous file operations
  • Complex regex evaluation
  • Large array operations

Check CPU Usage

High CPU correlates with event loop blocking:

# View CPU metrics
gcloud monitoring metrics list --filter="metric.type=run.googleapis.com/container/cpu/utilization"

Review Recent Changes

  • New evaluation criteria with complex logic?
  • Changes to trace processing?
  • New synchronous operations?

Common Causes & Remediation

Symptoms: Lag when processing large traces

Fix: Stream JSON parsing

// Instead of
const data = JSON.parse(hugeString);
 
// Use streaming
const { parse } = require('stream-json');
const pipeline = stream.pipe(parse());

Quick Mitigations

Increase CPU Allocation

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --cpu 2

Reduce Concurrency

Fewer concurrent requests = less event loop contention:

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --concurrency 20

Scale Out

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --min-instances 3 \
  --max-instances 20

Monitoring Event Loop

Add monitoring to the application:

// Using prom-client or similar
const { monitorEventLoopDelay } = require('perf_hooks');
 
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();
 
setInterval(() => {
  console.log(`Event loop P99: ${histogram.percentile(99) / 1e6}ms`);
  histogram.reset();
}, 60000);

Prevention

  • Profile code for blocking operations
  • Use worker threads for CPU-intensive tasks
  • Implement request size limits
  • Add timeouts to all operations
  • Regular performance testing
  • Monitor event loop metrics