Trace Ingestion Issues

⚠️

Severity: High | Alert Threshold: Ingestion rate drop > 50% OR ingestion errors > 5%

Overview

This alert triggers when trace ingestion is degraded, either through reduced throughput or increased error rates. This affects customers’ ability to capture agent interactions.

Ingestion Endpoints

Endpoint	Protocol	Purpose
`/api/v1/traces`	REST	SDK trace upload
`/v1/traces`	OTLP/HTTP	OpenTelemetry traces
`/api/otlp/v1/traces`	OTLP/HTTP	OTLP ingestion

Diagnostic Steps

Check Ingestion Metrics

# View recent ingestion requests
gcloud logging read 'httpRequest.requestUrl=~"/traces"' \
  --limit 100 \
  --format "table(timestamp, httpRequest.status, httpRequest.latency)"

Check for Ingestion Errors

# Find failed ingestion attempts
gcloud logging read 'httpRequest.requestUrl=~"/traces" AND httpRequest.status>=400' \
  --limit 50

Verify Endpoint Health

# Test endpoints
curl -X POST https://demo.thinkhive.ai/api/v1/traces \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"test": true}'

Check Database Write Performance

# Look for slow writes
gcloud logging read 'textPayload=~"trace" AND textPayload=~"insert"' \
  --limit 30

Common Causes & Remediation

Symptoms: Slow ingestion, timeout errors

Diagnostic:

-- Check for lock contention on traces table
SELECT * FROM pg_locks
WHERE relation = 'traces'::regclass;

Fix:

Check database performance (see Database Slow)
Consider batching writes
Add write replicas if available

Symptoms: 400 errors, validation failures

Diagnostic:

# Find validation errors
gcloud logging read 'textPayload=~"validation" AND severity>=WARNING' --limit 30

Common Issues:

Missing required fields (agentId, input)
Invalid span format
Incorrect timestamp format

Fix: Notify affected customers about payload format

Symptoms: 429 responses on ingestion

Diagnostic:

gcloud logging read 'httpRequest.status=429 AND httpRequest.requestUrl=~"/traces"' --limit 30

Fix:

Increase rate limits for high-volume customers
Help customers implement batching
See Rate Limit runbook

Symptoms: Large traces timing out

Diagnostic:

# Check for large payload errors
gcloud logging read 'textPayload=~"payload" OR textPayload=~"size"' --limit 20

Fix:

Increase payload size limits if legitimate
Help customers reduce trace size
Implement trace sampling

Ingestion Pipeline Health

Client SDK → API Gateway → Validation → Database Write → Async Processing

Check Each Stage

API Gateway: Check Cloud Run service health
Validation: Look for validation errors in logs
Database: Check write latency and connection pool
Async Processing: Check job queue (if using BullMQ)

Emergency Actions

Scale Up Ingestion Capacity

gcloud run services update thinkhive-demo \
  --region us-central1 \
  --memory 1Gi \
  --cpu 2 \
  --max-instances 20

Bypass Non-Critical Processing

If async processing is backed up, focus on core ingestion:

# Disable non-critical features temporarily
gcloud run services update thinkhive-demo \
  --region us-central1 \
  --update-env-vars SKIP_ASYNC_PROCESSING=true

Client SDK Best Practices

Share with affected customers:

// Batch traces instead of sending individually
const batch = [];
for (const trace of traces) {
  batch.push(trace);
  if (batch.length >= 10) {
    await client.sendBatch(batch);
    batch.length = 0;
  }
}
 
// Implement retry with backoff
async function sendWithRetry(trace, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await client.send(trace);
    } catch (e) {
      if (e.status === 429 || e.status >= 500) {
        await sleep(Math.pow(2, i) * 1000);
      } else {
        throw e;
      }
    }
  }
}

Prevention

Monitor ingestion latency and error rates
Set up alerts for ingestion drops
Implement graceful degradation
Regular load testing
Document payload limits clearly
Provide SDK batching features

Database Connections

Trace Ingestion Issues

Overview

Ingestion Endpoints

Diagnostic Steps

Check Ingestion Metrics

Check for Ingestion Errors

Verify Endpoint Health

Check Database Write Performance

Common Causes & Remediation

Ingestion Pipeline Health

Check Each Stage

Emergency Actions

Scale Up Ingestion Capacity

Bypass Non-Critical Processing

Client SDK Best Practices

Prevention

Related Runbooks