Building a Sub-Second Payment Pipeline with Node.js and Redis

EngineeringBy BetFlow Engineering Team

When a bettor taps “Deposit” seconds before kickoff, they expect the funds to appear instantly. Behind that simple interaction lies a pipeline that must validate the request, check compliance rules, select an acquirer, tokenize payment credentials, submit the authorization, process the response, update balances, and notify the client, all in under one second. This is the story of how we built that pipeline using Node.js, Redis, and a relentless focus on eliminating every unnecessary millisecond.

Our deposit processing system handles over 10,000 concurrent transactions during peak events like the Super Bowl or Champions League finals, with a p99 latency of 187ms. Getting here required rethinking every layer of our stack, from how we manage database connections to how we structure our event loop. This article walks through the architecture, the optimizations, and the hard lessons we learned along the way.

187ms

p99 End-to-End Latency

10K+

Concurrent Transactions

99.97%

Pipeline Uptime

Architecture Overview

The deposit pipeline is structured as a series of discrete stages connected by an in-process event bus. We evaluated message brokers like RabbitMQ and Kafka for inter-stage communication, but the overhead of network serialization added 15-30ms per hop, which was unacceptable for our latency budget. Instead, each transaction flows through the pipeline within a single Node.js process, with Redis serving as the shared state layer for cross-process coordination.

The pipeline consists of seven stages: request validation, compliance checking, acquirer selection, payment tokenization, authorization submission, response processing, and balance settlement. Each stage is implemented as an independent module with a consistent interface, which allows us to add, remove, or reorder stages without modifying the pipeline framework itself.

// Pipeline stage interface
interface PipelineStage<TContext> {
  name: string;
  execute(ctx: TContext): Promise<TContext>;
  rollback?(ctx: TContext): Promise<void>;
}

// Pipeline executor with timing and error handling
class PaymentPipeline {
  private stages: PipelineStage<TransactionContext>[] = [];
  private readonly metrics: MetricsCollector;

  async process(tx: TransactionContext): Promise<TransactionResult> {
    const completedStages: PipelineStage<TransactionContext>[] = [];

    for (const stage of this.stages) {
      const start = process.hrtime.bigint();
      try {
        tx = await stage.execute(tx);
        completedStages.push(stage);

        const elapsed = Number(process.hrtime.bigint() - start) / 1e6;
        this.metrics.recordStageLatency(stage.name, elapsed);
      } catch (error) {
        // Reverse compensating transactions
        for (const completed of completedStages.reverse()) {
          if (completed.rollback) {
            await completed.rollback(tx);
          }
        }
        throw error;
      }
    }

    return tx.result;
  }
}

Each stage has a strict time budget. Request validation gets 10ms. Compliance checking, which hits Redis for cached rules, gets 15ms. Acquirer selection, including the ML model inference, gets 50ms. Authorization submission, which involves a network call to the acquirer, gets the largest budget at 500ms with a circuit breaker that fails fast at 800ms. The remaining stages share the balance. If any stage exceeds its budget, we log a warning and investigate, but we do not kill the transaction since a slow success is better than a fast failure from the bettor's perspective.

Event-Driven Design and the Node.js Event Loop

Node.js gets a bad reputation in high-throughput payment systems, mostly from teams that fight the event loop rather than working with it. Our approach leans fully into Node's asynchronous nature, treating every I/O operation as an opportunity to process other transactions concurrently. The key insight is that a payment pipeline is almost entirely I/O-bound: the CPU-intensive work (validation, hashing, serialization) takes microseconds, while the I/O operations (database queries, Redis lookups, acquirer API calls) take milliseconds.

We use a worker thread pool for the small amount of CPU-intensive work that does exist, primarily cryptographic operations for PCI compliance. Payment card data is encrypted and tokenized in a dedicated worker thread, ensuring that the main event loop is never blocked by crypto operations. This was one of our most impactful optimizations: before moving crypto to worker threads, we saw event loop lag spikes of 50-80ms during peak load. After, the lag stays consistently under 3ms.

// Worker thread pool for CPU-intensive crypto operations
import { Worker } from 'worker_threads';
import { cpus } from 'os';

class CryptoWorkerPool {
  private workers: Worker[] = [];
  private queue: Array<{
    task: CryptoTask;
    resolve: (result: Buffer) => void;
    reject: (error: Error) => void;
  }> = [];
  private available: Worker[] = [];

  constructor(poolSize = Math.max(cpus().length - 2, 2)) {
    for (let i = 0; i < poolSize; i++) {
      const worker = new Worker('./crypto-worker.js');
      worker.on('message', (result) => {
        this.available.push(worker);
        this.processQueue();
      });
      this.workers.push(worker);
      this.available.push(worker);
    }
  }

  async tokenize(cardData: CardData): Promise<string> {
    return new Promise((resolve, reject) => {
      this.queue.push({
        task: { operation: 'tokenize', data: cardData },
        resolve,
        reject,
      });
      this.processQueue();
    });
  }

  private processQueue() {
    while (this.queue.length > 0 && this.available.length > 0) {
      const worker = this.available.pop()!;
      const job = this.queue.shift()!;
      worker.postMessage(job.task);
    }
  }
}

We also had to be disciplined about avoiding synchronous operations in the hot path. Early in development, we discovered that JSON serialization of large transaction objects was causing 3-5ms event loop blocks. We solved this by switching to a streaming JSON serializer for outbound acquirer requests and by pre-serializing frequently used objects (like compliance rule sets) during application startup rather than at request time.

Redis: Beyond Simple Caching

Redis is the backbone of our pipeline's state management, but we use it for far more than simple key-value caching. Our Redis deployment serves five distinct roles: session state for in-flight transactions, a distributed rate limiter, a real-time feature store for ML model inputs, a pub/sub channel for cross-process coordination, and a sorted set-based priority queue for retry scheduling.

For in-flight transaction state, we use Redis hashes with TTLs that match our maximum transaction lifetime (90 seconds). Each transaction gets a hash containing its current stage, accumulated context, and timing data. If a process crashes mid-transaction, any other process can pick up the state from Redis and either complete or roll back the transaction. This gives us fault tolerance without the complexity of distributed sagas.

// Redis-backed transaction state management
class TransactionStateManager {
  private readonly redis: RedisClient;
  private readonly ttl = 90; // seconds

  async initTransaction(txId: string, context: TransactionContext) {
    const pipeline = this.redis.multi();

    pipeline.hSet(`tx:${txId}`, {
      stage: 'validation',
      createdAt: Date.now().toString(),
      context: JSON.stringify(context),
      status: 'in_progress',
    });
    pipeline.expire(`tx:${txId}`, this.ttl);

    // Add to active transactions sorted set (scored by creation time)
    pipeline.zAdd('active_transactions', {
      score: Date.now(),
      value: txId,
    });

    await pipeline.exec();
  }

  async advanceStage(txId: string, stage: string, context: TransactionContext) {
    await this.redis.hSet(`tx:${txId}`, {
      stage,
      context: JSON.stringify(context),
      [`stage_${stage}_at`]: Date.now().toString(),
    });
  }

  async completeTransaction(txId: string, result: TransactionResult) {
    const pipeline = this.redis.multi();
    pipeline.hSet(`tx:${txId}`, { status: 'completed' });
    pipeline.zRem('active_transactions', txId);
    pipeline.expire(`tx:${txId}`, 300); // Keep for 5 min for debugging
    await pipeline.exec();
  }
}

The rate limiter uses Redis's atomic increment operations with sliding windows. We rate limit at multiple levels: per-player (to prevent rapid-fire deposits that might indicate compromised credentials), per-operator (to enforce contractual volume limits), and per-acquirer (to respect each acquirer's throughput preferences). The sliding window algorithm gives us much smoother rate limiting compared to fixed windows, avoiding the burst problem at window boundaries.

Performance Tip: We reduced our Redis round-trips by 60% by using MULTI/EXEC pipelines aggressively. Every transaction that needs to read or write multiple keys does so in a single pipeline. The difference between 6 individual Redis calls at 0.3ms each and a single pipelined call at 0.5ms total adds up fast at 10,000 transactions per second.

Connection Pooling and Resource Management

Connection management is where many Node.js payment systems fall apart at scale. Every external dependency, including MongoDB, Redis, acquirer APIs, and internal microservices, requires a pool of persistent connections, and getting the pool sizes right is both critical and counterintuitive.

For MongoDB, we found that the optimal pool size per process was surprisingly small: 20 connections for a process handling 2,500 transactions per second. Larger pools actually degraded performance because MongoDB's WiredTiger storage engine has its own internal concurrency controls, and too many concurrent operations cause lock contention. We run four Node.js processes per container, so each container maintains 80 MongoDB connections total.

Redis connection pooling follows a different pattern. We maintain separate connection pools for different use cases: one pool for transactional state (high throughput, small payloads), one for the feature store (read-heavy, larger payloads), and one for pub/sub (persistent connections). This separation prevents a burst of feature store reads from starving transactional state operations.

For acquirer API connections, we use HTTP/2 multiplexing wherever the acquirer supports it. This allows us to send multiple authorization requests over a single TCP connection, eliminating the connection setup overhead that plagued our earlier HTTP/1.1 implementation. For acquirers still on HTTP/1.1, we maintain a keep-alive pool with health checking to avoid the latency of TCP and TLS handshakes on every transaction.

Error Handling, Retry Logic, and Circuit Breakers

In a payment system, error handling is not an afterthought; it is the core of the product. We categorize errors into three buckets: hard declines (the issuer explicitly declined the transaction), soft declines (temporary failures that may succeed on retry), and system errors (infrastructure failures in our pipeline or at the acquirer).

Hard declines are returned immediately to the client with appropriate response codes. Soft declines trigger an automatic retry through an alternative acquirer, if one is available, within the same request cycle. The bettor never sees the soft decline; they only see the final result. This cascading retry strategy recovers approximately 8% of transactions that would otherwise be lost, which translates to millions of dollars in monthly deposit volume for our operators.

// Circuit breaker implementation for acquirer connections
class AcquirerCircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failureCount = 0;
  private lastFailureTime = 0;
  private readonly threshold = 5;       // failures before opening
  private readonly resetTimeout = 30000; // ms before half-open

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = 'half-open';
      } else {
        throw new CircuitOpenError(this.acquirerId);
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold) {
      this.state = 'open';
      this.metrics.increment('circuit_breaker.opened', {
        acquirer: this.acquirerId,
      });
    }
  }
}

Circuit breakers protect the pipeline from cascading failures when an acquirer experiences downtime. When an acquirer's failure rate exceeds the threshold, the circuit opens and all subsequent transactions are immediately routed to alternative acquirers without attempting the failing one. The circuit enters a half-open state after a cooldown period, allowing a small number of test transactions through to determine if the acquirer has recovered.

Monitoring and Observability

You cannot optimize what you cannot measure, and in a payment pipeline, the metrics that matter go far beyond simple request latency. We instrument every stage of the pipeline with high-resolution timing data, recording not just how long each stage takes but how long each external call within each stage takes. This gives us the ability to pinpoint degradation at the individual Redis command or database query level.

Our monitoring stack uses Prometheus for metrics collection, Grafana for dashboards, and PagerDuty for alerting. We maintain separate dashboards for pipeline health (latency percentiles, throughput, error rates), acquirer health (per-acquirer approval rates, latency, circuit breaker states), and business metrics (deposit volume, approval rates by operator, decline reason distribution). Alerts fire on both absolute thresholds and anomaly detection, such as when a specific acquirer's latency increases by more than 2x its 24-hour average.

Distributed tracing via OpenTelemetry gives us end-to-end visibility into every transaction. Each deposit request generates a trace that follows it through every pipeline stage, Redis operation, database query, and acquirer API call. When a bettor reports a slow or failed deposit, our support team can pull up the exact trace and see precisely where the bottleneck or failure occurred. This has reduced our mean time to diagnosis from 45 minutes to under 3 minutes.

Building a sub-second payment pipeline is not a one-time achievement; it is an ongoing discipline. Every dependency upgrade, every new compliance rule, every acquirer integration has the potential to add latency. We run continuous performance tests in a staging environment that mirrors production load patterns, and any change that increases p99 latency by more than 5ms requires explicit review and approval. This vigilance is what keeps us at 187ms, and we intend to push that number even lower.

Back to Blog