RunbooksTrace Ingestion Issues

Trace Ingestion Issues

⚠️

Severity: High | Alert Threshold: Ingestion rate drop > 50% OR ingestion errors > 5%

Overview

This alert triggers when trace ingestion is degraded, either through reduced throughput or increased error rates. This affects customers’ ability to capture agent interactions.

Ingestion Endpoints

EndpointProtocolPurpose
/api/v1/tracesRESTSDK trace upload
/v1/tracesOTLP/HTTPOpenTelemetry traces
/api/otlp/v1/tracesOTLP/HTTPOTLP ingestion

Diagnostic Steps

Check Ingestion Metrics

# View recent ingestion requests
gcloud logging read 'httpRequest.requestUrl=~"/traces"' \
  --limit 100 \
  --format "table(timestamp, httpRequest.status, httpRequest.latency)"

Check for Ingestion Errors

# Find failed ingestion attempts
gcloud logging read 'httpRequest.requestUrl=~"/traces" AND httpRequest.status>=400' \
  --limit 50

Verify Endpoint Health

# Test endpoints
curl -X POST https://demo.thinkhive.ai/api/v1/traces \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"test": true}'

Check Database Write Performance

# Look for slow writes
gcloud logging read 'textPayload=~"trace" AND textPayload=~"insert"' \
  --limit 30

Common Causes & Remediation

Symptoms: Slow ingestion, timeout errors

Diagnostic:

-- Check for lock contention on traces table
SELECT * FROM pg_locks
WHERE relation = 'traces'::regclass;

Fix:

  1. Check database performance (see Database Slow)
  2. Consider batching writes
  3. Add write replicas if available

Ingestion Pipeline Health

Client SDK → API Gateway → Validation → Database Write → Async Processing

Check Each Stage

  1. API Gateway: Check Cloud Run service health
  2. Validation: Look for validation errors in logs
  3. Database: Check write latency and connection pool
  4. Async Processing: Check job queue (if using BullMQ)

Emergency Actions

Scale Up Ingestion Capacity

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --memory 1Gi \
  --cpu 2 \
  --max-instances 20

Bypass Non-Critical Processing

If async processing is backed up, focus on core ingestion:

# Disable non-critical features temporarily
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars SKIP_ASYNC_PROCESSING=true

Client SDK Best Practices

Share with affected customers:

// Batch traces instead of sending individually
const batch = [];
for (const trace of traces) {
  batch.push(trace);
  if (batch.length >= 10) {
    await client.sendBatch(batch);
    batch.length = 0;
  }
}
 
// Implement retry with backoff
async function sendWithRetry(trace, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await client.send(trace);
    } catch (e) {
      if (e.status === 429 || e.status >= 500) {
        await sleep(Math.pow(2, i) * 1000);
      } else {
        throw e;
      }
    }
  }
}

Prevention

  • Monitor ingestion latency and error rates
  • Set up alerts for ingestion drops
  • Implement graceful degradation
  • Regular load testing
  • Document payload limits clearly
  • Provide SDK batching features