Async Buffer

The async buffer is an optional decorator that sits between the ingestion API and the storage engine. When enabled, logs are accepted into a short-lived queue and flushed to storage in background by a consumer pool, so a slow storage write does not block the producer. It is disabled by default and not appropriate for every storage engine.

Read this before enabling

The buffer is only a win on TimescaleDB. On ClickHouse and MongoDB our load tests measured a regression under saturation (the buffer fills faster than the consumer pool can drain, the breaker trips, and you pay overhead for nothing). If you are running ClickHouse or MongoDB, leave the buffer off and tune storage-side batching instead.

Overview

Every ingest call normally runs:

POST /api/v1/ingest -> validate -> reservoir.ingest(logs) -> storage.write(logs) -> 200 OK

With the buffer enabled the chain becomes:

POST /api/v1/ingest -> validate -> buffer.enqueue(logs) -> 200 OK
                                   |
                                   v
                         flush consumer pool (async)
                                   |
                                   v
                           storage.write(logs)

The producer returns as soon as the records are in the buffer. The consumer pool runs in the same backend process and drains shards concurrently. The buffer itself is pluggable: either a pure in-memory queue (single instance, not crash-safe) or Redis Streams (multi-instance, crash-safe, durable).

What the buffer gives you

Lower producer latency under bursty load, provided the storage engine is fast enough that enqueue completes well before the next batch arrives.

Smoothed write pressure on the storage engine: the consumer pool batches records per shard and flushes on size or age.

Backpressure via circuit breaker: when the buffer fills beyond a configured threshold, new ingests bypass the buffer and go straight to storage (sync).

Dead-letter queue on Redis transport: batches that fail after max retry attempts are parked on a side stream for manual inspection.

What the buffer does NOT give you

It does not remove the storage bottleneck. If your storage engine cannot keep up with the ingest rate, the buffer only delays the inevitable. You will see the breaker trip and latency spike regardless.

It does not improve query performance. Queries always go to the underlying storage directly.

It does not provide exactly-once delivery. On partial flush failure the succeeded kinds get re-delivered on retry. Observability workloads tolerate this; strict accounting workloads do not.

When to Enable

Good fit

You are running TimescaleDB and you see producer latency spikes during bursts.
You need crash-safe queuing (Redis transport) because producers and consumers may run on different processes or hosts.
You want to decouple the ingest API from storage maintenance windows (small windows only, not sustained downtime).
You want a circuit breaker that falls back to sync ingestion automatically when the buffer is overloaded.

When NOT to Enable

Bad fit

You are running ClickHouse or MongoDB. Both showed worse p95 with the buffer enabled under sustained 100 req/s. The bottleneck is the flush side (HTTP or mongo driver round-trip), which the buffer cannot hide.
Your ingestion rate is low enough that the sync path already meets your latency target.
You require strict exactly-once semantics. The buffer re-delivers succeeded kinds on partial flush failure.
You cannot tolerate losing in-flight records if the backend is killed without graceful shutdown (in-memory transport only; Redis transport is crash-safe).

Transports

In-Memory Transport

A per-process queue partitioned by shard. Producers push into a shard-specific list and signal waiting consumers; consumers claim entries into an inflight set until they are acked or reclaim-expired. There is no persistence: on crash or kill the contents are lost.

Use it when: you run a single backend instance, you can tolerate losing in-flight records on crash, and you want the lowest possible enqueue latency.

Avoid it when: you run more than one backend replica, or you cannot afford to lose the in-memory queue contents.

Redis Streams Transport

Each shard is a Redis Stream (XADD / XREADGROUP). Consumer groups track pending entries, reclaim stale deliveries via XAUTOCLAIM, and send permanently-failed batches to a DLQ stream. Atomic nack uses MULTI / EXEC so DLQ writes and acks either both land or neither does.

Use it when: you run more than one backend instance, you want crash-safety, or you want to be able to replay the DLQ later.

Avoid it when: adding a Redis dependency is not an option. The rest of LogTide can run without Redis (BullMQ falls back), but this transport cannot.

Configuration

All settings are environment variables. The buffer is off by default; setting the enable flag to true is the only change needed to activate it.

Variable	Default	Notes
`RESERVOIR_BUFFER_ENABLED`	`false`	Master switch. Set to `true` to enable.
`RESERVOIR_BUFFER_TRANSPORT`	`memory`	`memory`, `redis`, or `passthrough` (tests only).
`RESERVOIR_BUFFER_SHARDS`	`8`	Number of shards and flush-consumer tasks.
`RESERVOIR_BUFFER_MAX_BATCH_SIZE`	`500`	Max records a consumer claims per dequeue.
`RESERVOIR_BUFFER_MAX_BATCH_AGE_MS`	`1000`	Longest a consumer waits for records before returning empty.
`RESERVOIR_BUFFER_GRACEFUL_SHUTDOWN_MS`	`10000`	Drain timeout on SIGTERM / SIGINT.
`RESERVOIR_BUFFER_PENDING_THRESHOLD`	`10000`	Open circuit breaker when pending record count crosses this.
`RESERVOIR_BUFFER_FAILURE_THRESHOLD`	`20`	Flush failures in the rolling window that open the breaker.
`RESERVOIR_BUFFER_COOLDOWN_MS`	`30000`	Time the breaker stays open before probing half-open.
`RESERVOIR_BUFFER_MAX_RETRY_ATTEMPTS`	`3`	Attempts before a batch is sent to the DLQ.
`RESERVOIR_BUFFER_BASE_DELAY_MS`	`100`	Base for jittered exponential backoff between retries.
`RESERVOIR_BUFFER_STREAM_PREFIX`	`logtide:buffer`	Redis Streams key prefix (one stream per shard).
`REDIS_URL`	inherited from app	Required when `RESERVOIR_BUFFER_TRANSPORT=redis`. Missing value falls back to memory with a warning.

How It Works

The buffer is implemented as a decorator that wraps the standard Reservoir instance and implements the same public interface (IReservoir). Read operations pass through to the underlying reservoir unchanged. Write operations (ingest, ingestSpans, ingestMetrics) enqueue into the transport and return immediately.

Records are hashed by projectId into one of N shards (default 8) using murmur3. Each shard has a dedicated flush consumer task. Consumers dequeue a batch of records, group them by kind (log / span / metric), and call the three ingest methods on the underlying storage in parallel. Per-kind success is tracked separately so a partial failure only marks the failed kinds for retry.

Circuit Breaker

Two conditions open the breaker:

Pending record count crosses RESERVOIR_BUFFER_PENDING_THRESHOLD.
Consecutive flush failures within the rolling window cross RESERVOIR_BUFFER_FAILURE_THRESHOLD.

While open, new ingest() calls bypass the buffer and go straight to storage (synchronous path). This trades lower aggregate throughput for keeping the ingest API responsive and giving the consumer pool time to drain the backlog. After RESERVOIR_BUFFER_COOLDOWN_MS the breaker half-opens and the next successful flush closes it.

Graceful Shutdown

On SIGTERM / SIGINT the backend runs shutdownReservoir() after Fastify has stopped accepting new requests. The call stops the consumer pool and waits up to RESERVOIR_BUFFER_GRACEFUL_SHUTDOWN_MS for pending records to drain. Any in-flight Redis client is explicitly quit()-ed so the connection closes cleanly instead of being torn down by the process exit.

Records that cannot drain within the deadline remain in the transport. For the Redis transport they are durable and will be picked up by the next backend instance that joins the consumer group. For the in-memory transport they are lost, which is the main reason to prefer Redis for any deployment that values durability.

Benchmark Results

Methodology: k6, 100 req/s constant-arrival-rate for 3 minutes, 10 logs per batch, one backend container, storage container on the same host. Zero errors in all 9 runs.

Engine	Buffer OFF	Buffer ON (memory)	Buffer ON (redis)	Recommendation
TimescaleDB	p95 2471 ms	p95 24 ms	p95 623 ms	memory (single-instance) or redis (multi-instance)
ClickHouse	p95 3305 ms	p95 4129 ms	p95 6078 ms	keep buffer OFF
MongoDB	p95 675 ms	p95 844 ms	p95 756 ms	keep buffer OFF

The TimescaleDB numbers assume the in-process pg connection pool is shared between the backend and the buffer flush consumers. ClickHouse and MongoDB both require out-of-process round-trips on the flush side, which is the reason the buffer cannot hide their latency under saturation.

Metrics

When the buffer is enabled, the backend exposes the following Prometheus-compatible metrics. All metrics are labelled by record kind (log / span / metric) and shard where relevant.

Metric	Meaning
`reservoir_buffer_enqueued_total`	Records accepted into the buffer.
`reservoir_buffer_bypass_total`	Records that bypassed the buffer (breaker open).
`reservoir_buffer_flush_success_total`	Records successfully flushed to storage.
`reservoir_buffer_flush_failure_total`	Flush attempts that raised an error.
`reservoir_buffer_flush_duration_ms`	Histogram of flush durations.
`reservoir_buffer_dlq_total`	Records moved to the DLQ after max retry attempts.
`reservoir_buffer_breaker_state`	0 = closed, 1 = open, 2 = half-open.