Trace Ingestion Issues
⚠️
Severity: High | Alert Threshold: Ingestion rate drop > 50% OR ingestion errors > 5%
Overview
This alert triggers when trace ingestion is degraded, either through reduced throughput or increased error rates. This affects customers’ ability to capture agent interactions.
Ingestion Endpoints
| Endpoint | Protocol | Purpose |
|---|---|---|
/api/v1/traces | REST | SDK trace upload |
/v1/traces | OTLP/HTTP | OpenTelemetry traces |
/api/otlp/v1/traces | OTLP/HTTP | OTLP ingestion |
Diagnostic Steps
Check Ingestion Metrics
# View recent ingestion requests
gcloud logging read 'httpRequest.requestUrl=~"/traces"' \
--limit 100 \
--format "table(timestamp, httpRequest.status, httpRequest.latency)"Check for Ingestion Errors
# Find failed ingestion attempts
gcloud logging read 'httpRequest.requestUrl=~"/traces" AND httpRequest.status>=400' \
--limit 50Verify Endpoint Health
# Test endpoints
curl -X POST https://demo.thinkhive.ai/api/v1/traces \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"test": true}'Check Database Write Performance
# Look for slow writes
gcloud logging read 'textPayload=~"trace" AND textPayload=~"insert"' \
--limit 30Common Causes & Remediation
Symptoms: Slow ingestion, timeout errors
Diagnostic:
-- Check for lock contention on traces table
SELECT * FROM pg_locks
WHERE relation = 'traces'::regclass;Fix:
- Check database performance (see Database Slow)
- Consider batching writes
- Add write replicas if available
Ingestion Pipeline Health
Client SDK → API Gateway → Validation → Database Write → Async ProcessingCheck Each Stage
- API Gateway: Check Cloud Run service health
- Validation: Look for validation errors in logs
- Database: Check write latency and connection pool
- Async Processing: Check job queue (if using BullMQ)
Emergency Actions
Scale Up Ingestion Capacity
gcloud run services update thinkhive-demo \
--region us-central1 \
--memory 1Gi \
--cpu 2 \
--max-instances 20Bypass Non-Critical Processing
If async processing is backed up, focus on core ingestion:
# Disable non-critical features temporarily
gcloud run services update thinkhive-demo \
--region us-central1 \
--update-env-vars SKIP_ASYNC_PROCESSING=trueClient SDK Best Practices
Share with affected customers:
// Batch traces instead of sending individually
const batch = [];
for (const trace of traces) {
batch.push(trace);
if (batch.length >= 10) {
await client.sendBatch(batch);
batch.length = 0;
}
}
// Implement retry with backoff
async function sendWithRetry(trace, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await client.send(trace);
} catch (e) {
if (e.status === 429 || e.status >= 500) {
await sleep(Math.pow(2, i) * 1000);
} else {
throw e;
}
}
}
}Prevention
- Monitor ingestion latency and error rates
- Set up alerts for ingestion drops
- Implement graceful degradation
- Regular load testing
- Document payload limits clearly
- Provide SDK batching features