Performance

Advanced

· SaaS, Fintech, Gaming, IoT

High-Volume Log Management

Handle millions of log events per second with LogTide using batching, partitioning, and retention strategies at scale.

1M+ events/sec ingestion Intelligent batching Tiered retention Backpressure handling

When your systems generate millions of log events per second, naive logging approaches fail. Buffers overflow, disks fill up, and your observability pipeline becomes a liability. This guide covers the architecture and configuration patterns for handling high-volume log workloads with LogTide.

The Problem with High-Volume Logging

What Happens at Scale

At 10,000 events/sec, most logging setups work fine. At 100,000+, problems start:

❌ Common failure modes at high volume:

1. Network saturation     → SDK buffers fill, logs dropped
2. Disk I/O bottleneck    → Write latency spikes, queries slow
3. Memory pressure        → OOM kills on log processors
4. Ingestion lag          → Minutes of delay between event and visibility
5. Storage explosion      → Terabytes per day, costs spiral

Scale Reference Points

Events/sec	GB/day (avg 500 bytes/event)	Use Case
1,000	~43 GB	Small SaaS
10,000	~430 GB	Mid-size application
100,000	~4.3 TB	Large platform
1,000,000	~43 TB	High-traffic / IoT

The LogTide Approach

Architecture for High Volume

┌───────────┐     ┌─────────┐     ┌────────────┐     ┌─────────┐
│ Your Apps │────▶│  Kafka  │────▶│  LogTide   │────▶│ Queries │
│ (SDKs)    │     │ Cluster │     │  Ingesters │     │  & SIEM │
└───────────┘     └─────────┘     └────────────┘     └─────────┘
                       │                │
                       ▼                ▼
                  ┌─────────┐    ┌────────────┐
                  │  S3/GCS │    │ PostgreSQL │
                  │ Archive │    │ TimescaleDB│
                  └─────────┘    └────────────┘

Key principles:

Buffer with Kafka between your apps and LogTide — handle bursts without backpressure
Horizontal scaling of ingestion workers — add more consumers for more throughput
Tiered storage — hot data in PostgreSQL/TimescaleDB, cold data in S3
Sampling and filtering at the edge — not every debug log needs to be stored

Implementation

1. SDK-Level Batching

Configure your application SDKs for high-throughput batching:

// Node.js SDK - tuned for high volume
import { LogTideClient } from '@logtide/node';

const client = new LogTideClient({
  dsn: process.env.LOGTIDE_DSN!,
  service: 'high-throughput-api',

  // Batching tuning
  batchSize: 500,           // Larger batches (default: 100)
  flushInterval: 2000,      // Flush every 2s (default: 5s)
  maxQueueSize: 50000,      // Larger in-memory buffer

  // Compression
  compress: true,            // gzip batches before sending

  // Reliability
  maxRetries: 3,
  retryDelay: 1000,
});

# Python SDK - tuned for high volume
from logtide import LogTideClient

client = LogTideClient(
    api_url=os.environ["LOGTIDE_API_URL"],
    api_key=os.environ["LOGTIDE_API_KEY"],

    batch_size=500,
    flush_interval=2.0,
    async_mode=True,

    global_metadata={
        "environment": "production",
        "tier": "critical",
    },
)

2. Log Level Filtering

At high volume, not every log needs to reach LogTide. Filter at the SDK level:

// Only ship warning+ in production, debug+ in staging
const minLevel = process.env.NODE_ENV === 'production' ? 'warning' : 'debug';

const client = new LogTideClient({
  dsn: process.env.LOGTIDE_DSN!,
  service: 'api',
  minLevel,
});

// In hot paths, use conditional logging
if (client.isEnabled('debug')) {
  client.debug('Processing item', { itemId, step: 'validation' });
}

3. Sampling for High-Frequency Events

For events that fire thousands of times per second, sample instead of logging everything:

// Sample 1% of successful requests, 100% of errors
function shouldLog(statusCode: number): boolean {
  if (statusCode >= 400) return true;            // Always log errors
  if (statusCode >= 300) return true;            // Always log redirects
  return Math.random() < 0.01;                    // 1% sample for 2xx
}

app.use((req, res, next) => {
  const start = performance.now();

  res.on('finish', () => {
    if (shouldLog(res.statusCode)) {
      client.info(`${req.method} ${req.path} ${res.statusCode}`, {
        durationMs: Math.round(performance.now() - start),
        sampled: res.statusCode < 300,
        sampleRate: res.statusCode < 300 ? 0.01 : 1.0,
      });
    }
  });

  next();
});

4. Kafka Buffer Layer

For sustained high volume, add Kafka between your apps and LogTide:

// Producer: Your application sends to Kafka
import { Kafka } from 'kafkajs';

const kafka = new Kafka({
  brokers: process.env.KAFKA_BROKERS!.split(','),
});

const producer = kafka.producer({
  allowAutoTopicCreation: false,
});

await producer.connect();

// High-throughput producer configuration
await producer.send({
  topic: 'logs.application',
  compression: 2, // GZIP
  messages: batch.map(event => ({
    key: event.service,
    value: JSON.stringify(event),
  })),
});

See the Kafka Integration Guide for full setup.

5. Tiered Retention

Not all logs need the same retention:

# LogTide retention configuration
retention:
  # Hot tier: fast queries, expensive storage
  hot:
    duration: 7d
    storage: postgresql     # TimescaleDB hypertable

  # Warm tier: slower queries, cheaper storage
  warm:
    duration: 30d
    storage: compressed     # Compressed chunks

  # Cold tier: archival, cheapest storage
  cold:
    duration: 365d
    storage: s3             # Object storage
    bucket: logtide-archive

Cost impact at 100 GB/day:

Tier	Duration	Storage Cost
Hot (SSD)	7 days = 700 GB	~$70/month
Warm (compressed)	30 days = 1.5 TB (compressed)	~$30/month
Cold (S3)	365 days = 10 TB (compressed)	~$23/month
Total		~$123/month

vs. keeping everything in hot storage: ~$1,000/month.

6. Horizontal Scaling

Scale LogTide ingestion horizontally:

# Kubernetes: Scale ingestion workers
apiVersion: apps/v1
kind: Deployment
metadata:
  name: logtide-ingester
spec:
  replicas: 6        # Scale based on volume
  selector:
    matchLabels:
      app: logtide-ingester
  template:
    spec:
      containers:
        - name: ingester
          image: logtide/logtide:latest
          env:
            - name: LOGTIDE_MODE
              value: "ingester"
            - name: KAFKA_BROKERS
              value: "kafka:9092"
            - name: KAFKA_GROUP_ID
              value: "logtide-ingest"
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "2000m"
---
# HPA for auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: logtide-ingester-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: logtide-ingester
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Real-World Example: Gaming Platform

A real-time multiplayer platform generating 500,000 events/sec during peak hours.

Requirements:

Game server events (player actions, match state)
Anti-cheat detection (suspicious patterns)
Infrastructure monitoring (server health)
7-day hot retention, 90-day archival

Architecture:

Game Servers (200+)
    ↓ (UDP → Kafka)
Kafka (24 partitions, 3 brokers)
    ↓ (Consumer group)
LogTide Ingesters (6 replicas)
    ↓
TimescaleDB (hot: 7 days)
    ↓ (scheduled job)
S3 (cold: 90 days)

Configuration:

// Game server SDK config
const client = new LogTideClient({
  dsn: process.env.LOGTIDE_DSN!,
  service: 'game-server',
  batchSize: 1000,
  flushInterval: 1000,
  compress: true,
  maxQueueSize: 100000,
});

// Only log actionable events in production
// Debug events sampled at 0.1%
client.info('match_started', {
  matchId,
  players: playerIds.length,
  map: mapName,
  mode: gameMode,
});

Results:

Sustained 500k events/sec during peak
P99 ingestion latency: 200ms
Storage: 2.1 TB/day → 350 GB/day compressed
Monthly infrastructure cost: ~$800

Performance Tuning Checklist

SDK Layer

Batch size increased to 500+ for high volume
Flush interval reduced to 1-2 seconds
Compression enabled for network efficiency
Log level filtering applied (no debug in production)
Sampling configured for high-frequency events
Graceful shutdown with flush on SIGTERM

Transport Layer

Kafka deployed with replication factor 3
Partitions sized for parallelism (12+ for high throughput)
LZ4 compression on Kafka topics
Consumer group with enough consumers to match partitions
Consumer lag monitoring configured

Storage Layer

TimescaleDB hypertables with appropriate chunk intervals
Compression policies for chunks older than 1 day
Retention policies automated (don’t rely on manual cleanup)
Disk provisioned with 2x expected peak capacity
IOPS sufficient for write workload

Monitoring

Consumer lag alerts (>10,000 = warning, >100,000 = critical)
Ingestion throughput dashboard
Disk usage alerts at 70% and 85%
Query latency monitoring for degradation

Common Pitfalls

1. “We’ll just log everything”

At 100,000 events/sec, “everything” means 4.3 TB/day. Storage costs dominate.

Solution: Define what’s actionable. Sample routine events. Always log errors, alerts, and security events at full fidelity.

2. “Our SDK handles backpressure”

SDKs buffer in memory. If LogTide or Kafka is down for 5 minutes at 100k events/sec, that’s 30 million events in memory — potentially GBs of RAM.

Solution: Set maxQueueSize limits. Accept that during extended outages, some logs may be dropped. Log the drop count itself.

3. “We’ll tune performance later”

At high volume, defaults fail fast. A 5-second flush interval with batch size 100 means only 20 batches/sec — that’s 2,000 events/sec max throughput per client.

Solution: Tune batch size and flush interval before going to production at scale.

4. “Same retention for everything”

Keeping debug logs for a year at $0.10/GB/month is wasteful.

Solution: Tiered retention. 7 days hot for debugging, 30 days warm for analysis, 365 days cold for compliance.

Cost Comparison at Scale

100 GB/day ingestion:

Solution	Monthly Cost
Datadog	$3,000+
Splunk Cloud	$4,500+
Azure Monitor	$5,880+
AWS CloudWatch	$1,500+
LogTide (self-hosted)	$400-800

1 TB/day ingestion:

Solution	Monthly Cost
Datadog	$30,000+
Splunk Cloud	$45,000+
Azure Monitor	$58,800+
LogTide (self-hosted)	$2,000-4,000

Next Steps

Apache Kafka Integration - Build the buffer layer
Kubernetes Integration - Scale LogTide on K8s
Cost Optimization - Detailed savings analysis
Real-Time Alerting - Alert at high volume

Edit this page on GitHub