LogTide

Async Buffer

The async buffer is an optional decorator that sits between the ingestion API and the storage engine. When enabled, logs are accepted into a short-lived queue and flushed to storage in background by a consumer pool, so a slow storage write does not block the producer. It is disabled by default and not appropriate for every storage engine.

Read this before enabling
The buffer is only a win on TimescaleDB. On ClickHouse and MongoDB our load tests measured a regression under saturation (the buffer fills faster than the consumer pool can drain, the breaker trips, and you pay overhead for nothing). If you are running ClickHouse or MongoDB, leave the buffer off and tune storage-side batching instead.

Overview

Every ingest call normally runs:

POST /api/v1/ingest -> validate -> reservoir.ingest(logs) -> storage.write(logs) -> 200 OK

With the buffer enabled the chain becomes:

POST /api/v1/ingest -> validate -> buffer.enqueue(logs) -> 200 OK
                                   |
                                   v
                         flush consumer pool (async)
                                   |
                                   v
                           storage.write(logs)

The producer returns as soon as the records are in the buffer. The consumer pool runs in the same backend process and drains shards concurrently. The buffer itself is pluggable: either a pure in-memory queue (single instance, not crash-safe) or Redis Streams (multi-instance, crash-safe, durable).

What the buffer gives you

Lower producer latency under bursty load, provided the storage engine is fast enough that enqueue completes well before the next batch arrives.

Smoothed write pressure on the storage engine: the consumer pool batches records per shard and flushes on size or age.

Backpressure via circuit breaker: when the buffer fills beyond a configured threshold, new ingests bypass the buffer and go straight to storage (sync).

Dead-letter queue on Redis transport: batches that fail after max retry attempts are parked on a side stream for manual inspection.

What the buffer does NOT give you

It does not remove the storage bottleneck. If your storage engine cannot keep up with the ingest rate, the buffer only delays the inevitable. You will see the breaker trip and latency spike regardless.

It does not improve query performance. Queries always go to the underlying storage directly.

It does not provide exactly-once delivery. On partial flush failure the succeeded kinds get re-delivered on retry. Observability workloads tolerate this; strict accounting workloads do not.

When to Enable

Good fit
  • You are running TimescaleDB and you see producer latency spikes during bursts.
  • You need crash-safe queuing (Redis transport) because producers and consumers may run on different processes or hosts.
  • You want to decouple the ingest API from storage maintenance windows (small windows only, not sustained downtime).
  • You want a circuit breaker that falls back to sync ingestion automatically when the buffer is overloaded.

When NOT to Enable

Bad fit
  • You are running ClickHouse or MongoDB. Both showed worse p95 with the buffer enabled under sustained 100 req/s. The bottleneck is the flush side (HTTP or mongo driver round-trip), which the buffer cannot hide.
  • Your ingestion rate is low enough that the sync path already meets your latency target.
  • You require strict exactly-once semantics. The buffer re-delivers succeeded kinds on partial flush failure.
  • You cannot tolerate losing in-flight records if the backend is killed without graceful shutdown (in-memory transport only; Redis transport is crash-safe).

Transports

In-Memory Transport

A per-process queue partitioned by shard. Producers push into a shard-specific list and signal waiting consumers; consumers claim entries into an inflight set until they are acked or reclaim-expired. There is no persistence: on crash or kill the contents are lost.

Use it when: you run a single backend instance, you can tolerate losing in-flight records on crash, and you want the lowest possible enqueue latency.

Avoid it when: you run more than one backend replica, or you cannot afford to lose the in-memory queue contents.

Redis Streams Transport

Each shard is a Redis Stream (XADD / XREADGROUP). Consumer groups track pending entries, reclaim stale deliveries via XAUTOCLAIM, and send permanently-failed batches to a DLQ stream. Atomic nack uses MULTI / EXEC so DLQ writes and acks either both land or neither does.

Use it when: you run more than one backend instance, you want crash-safety, or you want to be able to replay the DLQ later.

Avoid it when: adding a Redis dependency is not an option. The rest of LogTide can run without Redis (BullMQ falls back), but this transport cannot.

Configuration

All settings are environment variables. The buffer is off by default; setting the enable flag to true is the only change needed to activate it.

Variable Default Notes
RESERVOIR_BUFFER_ENABLED false Master switch. Set to true to enable.
RESERVOIR_BUFFER_TRANSPORT memory memory, redis, or passthrough (tests only).
RESERVOIR_BUFFER_SHARDS 8 Number of shards and flush-consumer tasks.
RESERVOIR_BUFFER_MAX_BATCH_SIZE 500 Max records a consumer claims per dequeue.
RESERVOIR_BUFFER_MAX_BATCH_AGE_MS 1000 Longest a consumer waits for records before returning empty.
RESERVOIR_BUFFER_GRACEFUL_SHUTDOWN_MS 10000 Drain timeout on SIGTERM / SIGINT.
RESERVOIR_BUFFER_PENDING_THRESHOLD 10000 Open circuit breaker when pending record count crosses this.
RESERVOIR_BUFFER_FAILURE_THRESHOLD 20 Flush failures in the rolling window that open the breaker.
RESERVOIR_BUFFER_COOLDOWN_MS 30000 Time the breaker stays open before probing half-open.
RESERVOIR_BUFFER_MAX_RETRY_ATTEMPTS 3 Attempts before a batch is sent to the DLQ.
RESERVOIR_BUFFER_BASE_DELAY_MS 100 Base for jittered exponential backoff between retries.
RESERVOIR_BUFFER_STREAM_PREFIX logtide:buffer Redis Streams key prefix (one stream per shard).
REDIS_URL inherited from app Required when RESERVOIR_BUFFER_TRANSPORT=redis. Missing value falls back to memory with a warning.

How It Works

The buffer is implemented as a decorator that wraps the standard Reservoir instance and implements the same public interface (IReservoir). Read operations pass through to the underlying reservoir unchanged. Write operations (ingest, ingestSpans, ingestMetrics) enqueue into the transport and return immediately.

Records are hashed by projectId into one of N shards (default 8) using murmur3. Each shard has a dedicated flush consumer task. Consumers dequeue a batch of records, group them by kind (log / span / metric), and call the three ingest methods on the underlying storage in parallel. Per-kind success is tracked separately so a partial failure only marks the failed kinds for retry.

Circuit Breaker

Two conditions open the breaker:

  • Pending record count crosses RESERVOIR_BUFFER_PENDING_THRESHOLD.
  • Consecutive flush failures within the rolling window cross RESERVOIR_BUFFER_FAILURE_THRESHOLD.

While open, new ingest() calls bypass the buffer and go straight to storage (synchronous path). This trades lower aggregate throughput for keeping the ingest API responsive and giving the consumer pool time to drain the backlog. After RESERVOIR_BUFFER_COOLDOWN_MS the breaker half-opens and the next successful flush closes it.

Graceful Shutdown

On SIGTERM / SIGINT the backend runs shutdownReservoir() after Fastify has stopped accepting new requests. The call stops the consumer pool and waits up to RESERVOIR_BUFFER_GRACEFUL_SHUTDOWN_MS for pending records to drain. Any in-flight Redis client is explicitly quit()-ed so the connection closes cleanly instead of being torn down by the process exit.

Records that cannot drain within the deadline remain in the transport. For the Redis transport they are durable and will be picked up by the next backend instance that joins the consumer group. For the in-memory transport they are lost, which is the main reason to prefer Redis for any deployment that values durability.

Benchmark Results

Methodology: k6, 100 req/s constant-arrival-rate for 3 minutes, 10 logs per batch, one backend container, storage container on the same host. Zero errors in all 9 runs.

Engine Buffer OFF Buffer ON (memory) Buffer ON (redis) Recommendation
TimescaleDB p95 2471 ms p95 24 ms p95 623 ms memory (single-instance) or redis (multi-instance)
ClickHouse p95 3305 ms p95 4129 ms p95 6078 ms keep buffer OFF
MongoDB p95 675 ms p95 844 ms p95 756 ms keep buffer OFF

The TimescaleDB numbers assume the in-process pg connection pool is shared between the backend and the buffer flush consumers. ClickHouse and MongoDB both require out-of-process round-trips on the flush side, which is the reason the buffer cannot hide their latency under saturation.

Metrics

When the buffer is enabled, the backend exposes the following Prometheus-compatible metrics. All metrics are labelled by record kind (log / span / metric) and shard where relevant.

Metric Meaning
reservoir_buffer_enqueued_total Records accepted into the buffer.
reservoir_buffer_bypass_total Records that bypassed the buffer (breaker open).
reservoir_buffer_flush_success_total Records successfully flushed to storage.
reservoir_buffer_flush_failure_total Flush attempts that raised an error.
reservoir_buffer_flush_duration_ms Histogram of flush durations.
reservoir_buffer_dlq_total Records moved to the DLQ after max retry attempts.
reservoir_buffer_breaker_state 0 = closed, 1 = open, 2 = half-open.
Esc

Type to search across all documentation pages