High-Volume Log Management
Handle millions of log events per second with LogTide using batching, partitioning, and retention strategies at scale.
When your systems generate millions of log events per second, naive logging approaches fail. Buffers overflow, disks fill up, and your observability pipeline becomes a liability. This guide covers the architecture and configuration patterns for handling high-volume log workloads with LogTide.
The Problem with High-Volume Logging
What Happens at Scale
At 10,000 events/sec, most logging setups work fine. At 100,000+, problems start:
❌ Common failure modes at high volume:
1. Network saturation → SDK buffers fill, logs dropped
2. Disk I/O bottleneck → Write latency spikes, queries slow
3. Memory pressure → OOM kills on log processors
4. Ingestion lag → Minutes of delay between event and visibility
5. Storage explosion → Terabytes per day, costs spiral
Scale Reference Points
| Events/sec | GB/day (avg 500 bytes/event) | Use Case |
|---|---|---|
| 1,000 | ~43 GB | Small SaaS |
| 10,000 | ~430 GB | Mid-size application |
| 100,000 | ~4.3 TB | Large platform |
| 1,000,000 | ~43 TB | High-traffic / IoT |
The LogTide Approach
Architecture for High Volume
┌───────────┐ ┌─────────┐ ┌────────────┐ ┌─────────┐
│ Your Apps │────▶│ Kafka │────▶│ LogTide │────▶│ Queries │
│ (SDKs) │ │ Cluster │ │ Ingesters │ │ & SIEM │
└───────────┘ └─────────┘ └────────────┘ └─────────┘
│ │
▼ ▼
┌─────────┐ ┌────────────┐
│ S3/GCS │ │ PostgreSQL │
│ Archive │ │ TimescaleDB│
└─────────┘ └────────────┘
Key principles:
- Buffer with Kafka between your apps and LogTide — handle bursts without backpressure
- Horizontal scaling of ingestion workers — add more consumers for more throughput
- Tiered storage — hot data in PostgreSQL/TimescaleDB, cold data in S3
- Sampling and filtering at the edge — not every debug log needs to be stored
Implementation
1. SDK-Level Batching
Configure your application SDKs for high-throughput batching:
// Node.js SDK - tuned for high volume
import { LogTideClient } from '@logtide/node';
const client = new LogTideClient({
dsn: process.env.LOGTIDE_DSN!,
service: 'high-throughput-api',
// Batching tuning
batchSize: 500, // Larger batches (default: 100)
flushInterval: 2000, // Flush every 2s (default: 5s)
maxQueueSize: 50000, // Larger in-memory buffer
// Compression
compress: true, // gzip batches before sending
// Reliability
maxRetries: 3,
retryDelay: 1000,
});
# Python SDK - tuned for high volume
from logtide import LogTideClient
client = LogTideClient(
api_url=os.environ["LOGTIDE_API_URL"],
api_key=os.environ["LOGTIDE_API_KEY"],
batch_size=500,
flush_interval=2.0,
async_mode=True,
global_metadata={
"environment": "production",
"tier": "critical",
},
)
2. Log Level Filtering
At high volume, not every log needs to reach LogTide. Filter at the SDK level:
// Only ship warning+ in production, debug+ in staging
const minLevel = process.env.NODE_ENV === 'production' ? 'warning' : 'debug';
const client = new LogTideClient({
dsn: process.env.LOGTIDE_DSN!,
service: 'api',
minLevel,
});
// In hot paths, use conditional logging
if (client.isEnabled('debug')) {
client.debug('Processing item', { itemId, step: 'validation' });
}
3. Sampling for High-Frequency Events
For events that fire thousands of times per second, sample instead of logging everything:
// Sample 1% of successful requests, 100% of errors
function shouldLog(statusCode: number): boolean {
if (statusCode >= 400) return true; // Always log errors
if (statusCode >= 300) return true; // Always log redirects
return Math.random() < 0.01; // 1% sample for 2xx
}
app.use((req, res, next) => {
const start = performance.now();
res.on('finish', () => {
if (shouldLog(res.statusCode)) {
client.info(`${req.method} ${req.path} ${res.statusCode}`, {
durationMs: Math.round(performance.now() - start),
sampled: res.statusCode < 300,
sampleRate: res.statusCode < 300 ? 0.01 : 1.0,
});
}
});
next();
});
4. Kafka Buffer Layer
For sustained high volume, add Kafka between your apps and LogTide:
// Producer: Your application sends to Kafka
import { Kafka } from 'kafkajs';
const kafka = new Kafka({
brokers: process.env.KAFKA_BROKERS!.split(','),
});
const producer = kafka.producer({
allowAutoTopicCreation: false,
});
await producer.connect();
// High-throughput producer configuration
await producer.send({
topic: 'logs.application',
compression: 2, // GZIP
messages: batch.map(event => ({
key: event.service,
value: JSON.stringify(event),
})),
});
See the Kafka Integration Guide for full setup.
5. Tiered Retention
Not all logs need the same retention:
# LogTide retention configuration
retention:
# Hot tier: fast queries, expensive storage
hot:
duration: 7d
storage: postgresql # TimescaleDB hypertable
# Warm tier: slower queries, cheaper storage
warm:
duration: 30d
storage: compressed # Compressed chunks
# Cold tier: archival, cheapest storage
cold:
duration: 365d
storage: s3 # Object storage
bucket: logtide-archive
Cost impact at 100 GB/day:
| Tier | Duration | Storage Cost |
|---|---|---|
| Hot (SSD) | 7 days = 700 GB | ~$70/month |
| Warm (compressed) | 30 days = 1.5 TB (compressed) | ~$30/month |
| Cold (S3) | 365 days = 10 TB (compressed) | ~$23/month |
| Total | ~$123/month |
vs. keeping everything in hot storage: ~$1,000/month.
6. Horizontal Scaling
Scale LogTide ingestion horizontally:
# Kubernetes: Scale ingestion workers
apiVersion: apps/v1
kind: Deployment
metadata:
name: logtide-ingester
spec:
replicas: 6 # Scale based on volume
selector:
matchLabels:
app: logtide-ingester
template:
spec:
containers:
- name: ingester
image: logtide/logtide:latest
env:
- name: LOGTIDE_MODE
value: "ingester"
- name: KAFKA_BROKERS
value: "kafka:9092"
- name: KAFKA_GROUP_ID
value: "logtide-ingest"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
---
# HPA for auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: logtide-ingester-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: logtide-ingester
minReplicas: 3
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Real-World Example: Gaming Platform
A real-time multiplayer platform generating 500,000 events/sec during peak hours.
Requirements:
- Game server events (player actions, match state)
- Anti-cheat detection (suspicious patterns)
- Infrastructure monitoring (server health)
- 7-day hot retention, 90-day archival
Architecture:
Game Servers (200+)
↓ (UDP → Kafka)
Kafka (24 partitions, 3 brokers)
↓ (Consumer group)
LogTide Ingesters (6 replicas)
↓
TimescaleDB (hot: 7 days)
↓ (scheduled job)
S3 (cold: 90 days)
Configuration:
// Game server SDK config
const client = new LogTideClient({
dsn: process.env.LOGTIDE_DSN!,
service: 'game-server',
batchSize: 1000,
flushInterval: 1000,
compress: true,
maxQueueSize: 100000,
});
// Only log actionable events in production
// Debug events sampled at 0.1%
client.info('match_started', {
matchId,
players: playerIds.length,
map: mapName,
mode: gameMode,
});
Results:
- Sustained 500k events/sec during peak
- P99 ingestion latency: 200ms
- Storage: 2.1 TB/day → 350 GB/day compressed
- Monthly infrastructure cost: ~$800
Performance Tuning Checklist
SDK Layer
- Batch size increased to 500+ for high volume
- Flush interval reduced to 1-2 seconds
- Compression enabled for network efficiency
- Log level filtering applied (no debug in production)
- Sampling configured for high-frequency events
- Graceful shutdown with flush on SIGTERM
Transport Layer
- Kafka deployed with replication factor 3
- Partitions sized for parallelism (12+ for high throughput)
- LZ4 compression on Kafka topics
- Consumer group with enough consumers to match partitions
- Consumer lag monitoring configured
Storage Layer
- TimescaleDB hypertables with appropriate chunk intervals
- Compression policies for chunks older than 1 day
- Retention policies automated (don’t rely on manual cleanup)
- Disk provisioned with 2x expected peak capacity
- IOPS sufficient for write workload
Monitoring
- Consumer lag alerts (>10,000 = warning, >100,000 = critical)
- Ingestion throughput dashboard
- Disk usage alerts at 70% and 85%
- Query latency monitoring for degradation
Common Pitfalls
1. “We’ll just log everything”
At 100,000 events/sec, “everything” means 4.3 TB/day. Storage costs dominate.
Solution: Define what’s actionable. Sample routine events. Always log errors, alerts, and security events at full fidelity.
2. “Our SDK handles backpressure”
SDKs buffer in memory. If LogTide or Kafka is down for 5 minutes at 100k events/sec, that’s 30 million events in memory — potentially GBs of RAM.
Solution: Set maxQueueSize limits. Accept that during extended outages, some logs may be dropped. Log the drop count itself.
3. “We’ll tune performance later”
At high volume, defaults fail fast. A 5-second flush interval with batch size 100 means only 20 batches/sec — that’s 2,000 events/sec max throughput per client.
Solution: Tune batch size and flush interval before going to production at scale.
4. “Same retention for everything”
Keeping debug logs for a year at $0.10/GB/month is wasteful.
Solution: Tiered retention. 7 days hot for debugging, 30 days warm for analysis, 365 days cold for compliance.
Cost Comparison at Scale
100 GB/day ingestion:
| Solution | Monthly Cost |
|---|---|
| Datadog | $3,000+ |
| Splunk Cloud | $4,500+ |
| Azure Monitor | $5,880+ |
| AWS CloudWatch | $1,500+ |
| LogTide (self-hosted) | $400-800 |
1 TB/day ingestion:
| Solution | Monthly Cost |
|---|---|
| Datadog | $30,000+ |
| Splunk Cloud | $45,000+ |
| Azure Monitor | $58,800+ |
| LogTide (self-hosted) | $2,000-4,000 |
Next Steps
- Apache Kafka Integration - Build the buffer layer
- Kubernetes Integration - Scale LogTide on K8s
- Cost Optimization - Detailed savings analysis
- Real-Time Alerting - Alert at high volume