LogTide
Performance
Advanced
· SaaS, Fintech, Gaming, IoT

High-Volume Log Management

Handle millions of log events per second with LogTide using batching, partitioning, and retention strategies at scale.

1M+ events/sec ingestion Intelligent batching Tiered retention Backpressure handling

When your systems generate millions of log events per second, naive logging approaches fail. Buffers overflow, disks fill up, and your observability pipeline becomes a liability. This guide covers the architecture and configuration patterns for handling high-volume log workloads with LogTide.

The Problem with High-Volume Logging

What Happens at Scale

At 10,000 events/sec, most logging setups work fine. At 100,000+, problems start:

❌ Common failure modes at high volume:

1. Network saturation     → SDK buffers fill, logs dropped
2. Disk I/O bottleneck    → Write latency spikes, queries slow
3. Memory pressure        → OOM kills on log processors
4. Ingestion lag          → Minutes of delay between event and visibility
5. Storage explosion      → Terabytes per day, costs spiral

Scale Reference Points

Events/secGB/day (avg 500 bytes/event)Use Case
1,000~43 GBSmall SaaS
10,000~430 GBMid-size application
100,000~4.3 TBLarge platform
1,000,000~43 TBHigh-traffic / IoT

The LogTide Approach

Architecture for High Volume

┌───────────┐     ┌─────────┐     ┌────────────┐     ┌─────────┐
│ Your Apps │────▶│  Kafka  │────▶│  LogTide   │────▶│ Queries │
│ (SDKs)    │     │ Cluster │     │  Ingesters │     │  & SIEM │
└───────────┘     └─────────┘     └────────────┘     └─────────┘
                       │                │
                       ▼                ▼
                  ┌─────────┐    ┌────────────┐
                  │  S3/GCS │    │ PostgreSQL │
                  │ Archive │    │ TimescaleDB│
                  └─────────┘    └────────────┘

Key principles:

  1. Buffer with Kafka between your apps and LogTide — handle bursts without backpressure
  2. Horizontal scaling of ingestion workers — add more consumers for more throughput
  3. Tiered storage — hot data in PostgreSQL/TimescaleDB, cold data in S3
  4. Sampling and filtering at the edge — not every debug log needs to be stored

Implementation

1. SDK-Level Batching

Configure your application SDKs for high-throughput batching:

// Node.js SDK - tuned for high volume
import { LogTideClient } from '@logtide/sdk-node';

const client = new LogTideClient({
  apiUrl: process.env.LOGTIDE_API_URL!,
  apiKey: process.env.LOGTIDE_API_KEY!,

  // Batching tuning
  batchSize: 500,           // Larger batches (default: 100)
  flushInterval: 2000,      // Flush every 2s (default: 5s)
  maxBufferSize: 50000,     // Larger in-memory buffer

  // Reliability
  maxRetries: 3,
  retryDelayMs: 1000,
});
# Python SDK - tuned for high volume
import os

from logtide_sdk import LogTideClient, ClientOptions

client = LogTideClient(ClientOptions(
    api_url=os.environ["LOGTIDE_API_URL"],
    api_key=os.environ["LOGTIDE_API_KEY"],

    batch_size=500,
    flush_interval=2.0,

    global_metadata={
        "environment": "production",
        "tier": "critical",
    },
))

2. Log Level Filtering

At high volume, not every log needs to reach LogTide. Filter at the SDK level:

const client = new LogTideClient({
  apiUrl: process.env.LOGTIDE_API_URL!,
  apiKey: process.env.LOGTIDE_API_KEY!,
});

// Only ship warn+ in production, debug+ in staging
const LEVELS = ['debug', 'info', 'warn', 'error', 'critical'];
const minLevel = process.env.NODE_ENV === 'production' ? 'warn' : 'debug';
const enabled = (level: string) => LEVELS.indexOf(level) >= LEVELS.indexOf(minLevel);

// In hot paths, gate logging before the call
if (enabled('debug')) {
  client.debug('api', 'Processing item', { itemId, step: 'validation' });
}

3. Sampling for High-Frequency Events

For events that fire thousands of times per second, sample instead of logging everything:

// Sample 1% of successful requests, 100% of errors
function shouldLog(statusCode: number): boolean {
  if (statusCode >= 400) return true;            // Always log errors
  if (statusCode >= 300) return true;            // Always log redirects
  return Math.random() < 0.01;                    // 1% sample for 2xx
}

app.use((req, res, next) => {
  const start = performance.now();

  res.on('finish', () => {
    if (shouldLog(res.statusCode)) {
      client.info('api', `${req.method} ${req.path} ${res.statusCode}`, {
        durationMs: Math.round(performance.now() - start),
        sampled: res.statusCode < 300,
        sampleRate: res.statusCode < 300 ? 0.01 : 1.0,
      });
    }
  });

  next();
});

4. Kafka Buffer Layer

For sustained high volume, add Kafka between your apps and LogTide:

// Producer: Your application sends to Kafka
import { Kafka } from 'kafkajs';

const kafka = new Kafka({
  brokers: process.env.KAFKA_BROKERS!.split(','),
});

const producer = kafka.producer({
  allowAutoTopicCreation: false,
});

await producer.connect();

// High-throughput producer configuration
await producer.send({
  topic: 'logs.application',
  compression: 2, // GZIP
  messages: batch.map(event => ({
    key: event.service,
    value: JSON.stringify(event),
  })),
});

See the Kafka Integration Guide for full setup.

5. Tiered Retention

Not all logs need the same retention:

# LogTide retention configuration
retention:
  # Hot tier: fast queries, expensive storage
  hot:
    duration: 7d
    storage: postgresql     # TimescaleDB hypertable

  # Warm tier: slower queries, cheaper storage
  warm:
    duration: 30d
    storage: compressed     # Compressed chunks

  # Cold tier: archival, cheapest storage
  cold:
    duration: 365d
    storage: s3             # Object storage
    bucket: logtide-archive

Cost impact at 100 GB/day:

TierDurationStorage Cost
Hot (SSD)7 days = 700 GB~$70/month
Warm (compressed)30 days = 1.5 TB (compressed)~$30/month
Cold (S3)365 days = 10 TB (compressed)~$23/month
Total~$123/month

vs. keeping everything in hot storage: ~$1,000/month.

6. Horizontal Scaling

Scale LogTide ingestion horizontally:

# Kubernetes: Scale ingestion workers
apiVersion: apps/v1
kind: Deployment
metadata:
  name: logtide-ingester
spec:
  replicas: 6        # Scale based on volume
  selector:
    matchLabels:
      app: logtide-ingester
  template:
    spec:
      containers:
        - name: ingester
          image: logtide/logtide:latest
          env:
            - name: LOGTIDE_MODE
              value: "ingester"
            - name: KAFKA_BROKERS
              value: "kafka:9092"
            - name: KAFKA_GROUP_ID
              value: "logtide-ingest"
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "2000m"
---
# HPA for auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: logtide-ingester-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: logtide-ingester
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Real-World Example: Gaming Platform

A real-time multiplayer platform generating 500,000 events/sec during peak hours.

Requirements:

  • Game server events (player actions, match state)
  • Anti-cheat detection (suspicious patterns)
  • Infrastructure monitoring (server health)
  • 7-day hot retention, 90-day archival

Architecture:

Game Servers (200+)
    ↓ (UDP → Kafka)
Kafka (24 partitions, 3 brokers)
    ↓ (Consumer group)
LogTide Ingesters (6 replicas)

TimescaleDB (hot: 7 days)
    ↓ (scheduled job)
S3 (cold: 90 days)

Configuration:

// Game server SDK config
const client = new LogTideClient({
  apiUrl: process.env.LOGTIDE_API_URL!,
  apiKey: process.env.LOGTIDE_API_KEY!,
  batchSize: 1000,
  flushInterval: 1000,
  maxBufferSize: 100000,
});

// Only log actionable events in production
// Debug events sampled at 0.1%
client.info('game-server', 'match_started', {
  matchId,
  players: playerIds.length,
  map: mapName,
  mode: gameMode,
});

Results:

  • Sustained 500k events/sec during peak
  • P99 ingestion latency: 200ms
  • Storage: 2.1 TB/day → 350 GB/day compressed
  • Monthly infrastructure cost: ~$800

Performance Tuning Checklist

SDK Layer

  • Batch size increased to 500+ for high volume
  • Flush interval reduced to 1-2 seconds
  • Compression enabled for network efficiency
  • Log level filtering applied (no debug in production)
  • Sampling configured for high-frequency events
  • Graceful shutdown with flush on SIGTERM

Transport Layer

  • Kafka deployed with replication factor 3
  • Partitions sized for parallelism (12+ for high throughput)
  • LZ4 compression on Kafka topics
  • Consumer group with enough consumers to match partitions
  • Consumer lag monitoring configured

Storage Layer

  • TimescaleDB hypertables with appropriate chunk intervals
  • Compression policies for chunks older than 1 day
  • Retention policies automated (don’t rely on manual cleanup)
  • Disk provisioned with 2x expected peak capacity
  • IOPS sufficient for write workload

Monitoring

  • Consumer lag alerts (>10,000 = warning, >100,000 = critical)
  • Ingestion throughput dashboard
  • Disk usage alerts at 70% and 85%
  • Query latency monitoring for degradation

Common Pitfalls

1. “We’ll just log everything”

At 100,000 events/sec, “everything” means 4.3 TB/day. Storage costs dominate.

Solution: Define what’s actionable. Sample routine events. Always log errors, alerts, and security events at full fidelity.

2. “Our SDK handles backpressure”

SDKs buffer in memory. If LogTide or Kafka is down for 5 minutes at 100k events/sec, that’s 30 million events in memory — potentially GBs of RAM.

Solution: Set maxBufferSize limits. Accept that during extended outages, some logs may be dropped. Log the drop count itself.

3. “We’ll tune performance later”

At high volume, defaults fail fast. A 5-second flush interval with batch size 100 means only 20 batches/sec — that’s 2,000 events/sec max throughput per client.

Solution: Tune batch size and flush interval before going to production at scale.

4. “Same retention for everything”

Keeping debug logs for a year at $0.10/GB/month is wasteful.

Solution: Tiered retention. 7 days hot for debugging, 30 days warm for analysis, 365 days cold for compliance.

Cost Comparison at Scale

100 GB/day ingestion:

SolutionMonthly Cost
Datadog$3,000+
Splunk Cloud$4,500+
Azure Monitor$5,880+
AWS CloudWatch$1,500+
LogTide (self-hosted)$400-800

1 TB/day ingestion:

SolutionMonthly Cost
Datadog$30,000+
Splunk Cloud$45,000+
Azure Monitor$58,800+
LogTide (self-hosted)$2,000-4,000

Next Steps

Frequently Asked Questions

Can LogTide handle high log volume at scale?

Yes. LogTide supports sustained ingestion of 1 million or more events per second using a Kafka buffer layer, horizontally scaled ingestion workers, and TimescaleDB hypertables for storage. A real-world gaming platform example in the documentation demonstrates 500,000 events per second with a p99 ingestion latency of 200ms.

How does LogTide prevent log loss during traffic bursts?

LogTide recommends placing Apache Kafka between your application SDKs and the LogTide ingesters. Kafka absorbs burst traffic so that temporary slowdowns in ingestion do not cause backpressure or dropped events in your application. The SDK also provides a configurable in-memory buffer with a maxBufferSize limit and drop policy to protect application memory.

How do I reduce storage costs at high log volume with LogTide?

LogTide supports tiered retention: recent data (hot tier) stays in TimescaleDB on fast SSD, older data moves to compressed chunks (warm tier), and archival data is offloaded to object storage such as S3 (cold tier). At 100 GB/day this tiered approach costs roughly $123 per month in storage versus approximately $1,000 per month for keeping everything in hot storage.

What SDK settings should I tune for high-throughput logging?

For high-volume workloads, increase batchSize to 500 or more, reduce flushInterval to 1-2 seconds, and raise maxBufferSize to accommodate burst traffic. Filtering out debug logs in production and sampling high-frequency successful requests at a low rate (such as 1%) are also strongly recommended to control storage growth.