The authorization flow does not wait for your fraud model to finish thinking. From the moment a transaction hits a payment processor to the moment an authorization response goes back, the entire window — including network round-trips, issuer processing, and any third-party services in the chain — is on the order of 1-2 seconds for a standard card-present transaction. For fraud scoring that sits inline in that flow, the practical latency budget is often 50-80ms, and some processors will give you less.
We target a median scoring latency of 47ms, with a p99 below 120ms. That is not a marketing number — it is the constraint that our integration architecture has to satisfy to be a viable inline fraud score rather than an async enrichment service that gets consulted after the fact. Here is what it actually takes to operate at that latency while running 140+ signals per transaction.
The Core Latency Problem
The naive approach to multi-signal fraud scoring involves fetching signals from multiple data stores, running them through a model, and returning a score. Each data fetch adds latency. If you are pulling from five different stores with an average 10ms fetch time each, you have already consumed your budget before the model even runs. Sequential fetches are not viable at scale.
The first architectural principle is aggressive parallelism in signal retrieval. Every signal fetch that does not depend on another fetch must be launched simultaneously. This sounds obvious, but it requires careful dependency mapping across all 140+ signals to identify which can be parallelized and which have genuine ordering requirements. In practice, the vast majority of signals can be fetched in parallel, and the serial dependencies tend to be few and predictable.
The second principle is pre-computation. Many of the signals we compute are aggregations over historical data — rolling velocity counts, historical percentile distributions, account-level behavioral baselines. Computing these on-demand at score time would be prohibitively expensive. Instead, we maintain pre-computed signal state that is updated asynchronously as transactions arrive, so that at score time we are reading a pre-computed value rather than executing an aggregation query. The freshness of these pre-computed values is a deliberate design tradeoff: we accept signal staleness measured in seconds in exchange for the ability to serve them at sub-millisecond read latency.
Signal Storage Architecture
The storage layer for real-time fraud scoring is not a single database. Different signal types have different latency requirements and access patterns, and trying to serve all of them from one store produces either unacceptable latency or unacceptable cost.
We operate with three storage tiers:
In-process memory cache. The highest-frequency signals — recent transaction count, last-seen device fingerprint, account tenure, pre-computed velocity counts — live in a hot in-process cache on the scoring service. Read latency is under 0.5ms. The tradeoff is that this cache must be kept small (we budget a few gigabytes per scoring instance), requires careful invalidation logic, and is not shared across scoring instances, which means some stale reads during cache misses. For the signals in this tier, the cost of occasional stale reads is acceptable because the signal is used at a coarse granularity anyway.
Near-process key-value store. Signals that are too large to fit in-process but still need sub-5ms read latency live in a co-located key-value store. These include device history records, account-level behavioral baselines, and network graph lookups. This tier handles the majority of signal fetches by count.
Async enrichment. A small number of signals cannot be served within the scoring latency budget at all — third-party data lookups, complex graph traversals, bureau checks. These are not used in the inline score. Instead, they feed an asynchronous enrichment layer that updates account risk profiles between transactions. By the time a transaction arrives, this enrichment has already been computed and stored in one of the faster tiers.
The Model Inference Problem
Running a gradient-boosted model over 140+ features takes time. On a modern server CPU, a single inference call on a well-optimized XGBoost model takes roughly 1-3ms depending on tree depth and feature count. That is acceptable as a fraction of a 47ms budget, but it assumes the model is loaded in memory and the inference library is compiled for the target architecture.
We are not suggesting deep learning or transformer-based architectures for inline fraud scoring. The latency characteristics of neural network inference — especially for the kind of feature-rich tabular data that fraud scoring involves — are worse than gradient-boosted trees for equivalent detection performance. We have evaluated this tradeoff, and the inference latency penalty of deep learning does not justify the marginal detection improvement on fraud scoring tasks where the feature engineering is the main source of signal anyway.
Model serving is done through a compiled inference runtime, not through a Python-based model serving layer. The difference in inference latency between a compiled C++ model server and a Python process with a scikit-learn or XGBoost wrapper is typically 3-8x. At our scale, that difference is the margin between hitting the latency target and missing it.
Tail Latency Is the Real Constraint
Median latency of 47ms is meaningful, but the metric that actually matters for payment integrations is p99 latency — what happens in the worst 1% of requests. An inline fraud score that occasionally takes 500ms is worse than one that averages 60ms reliably, because the long-tail latency spikes create authorization timeouts that are hard to distinguish from fraud model failures.
The sources of tail latency in our system are predictable: cache misses that force reads to slower storage tiers, garbage collection pauses in the JVM-based components, and network jitter to external data sources. Each of these can be addressed architecturally:
- Cache miss impact is mitigated by pre-warming caches for accounts with recent activity and setting conservative cache TTLs that accept slightly higher memory usage in exchange for lower miss rates.
- GC pause impact is addressed by using off-heap storage for the hot cache and tuning GC settings to prefer lower pause frequency over lower average latency.
- Network jitter to external sources is mitigated by timeouts with fallback behavior: if an external lookup does not return within budget, the signal is treated as absent rather than waited on. The model is trained to handle missing signal values gracefully, which means a network blip degrades score quality slightly rather than failing the scoring request entirely.
What We Sacrificed to Get Here
We want to be honest about the tradeoffs. Real-time scoring at 47ms median latency is achievable, but it is not free. The pre-computation approach means some signals have seconds of staleness — in a fast-moving attack, that window can matter. The in-process cache means different scoring instances may see slightly different signal values during periods of rapid state change. The fallback behavior for slow external lookups means some signals are occasionally absent from scores during infrastructure hiccups.
None of these tradeoffs are deal-breakers for inline fraud scoring. The stale-signal window is small enough that it does not meaningfully advantage attackers, and the consistency tradeoffs are well within acceptable bounds for the statistical nature of fraud scoring. But any fraud scoring system that claims to run 140+ signals at 50ms latency without some form of pre-computation and signal staleness is not being honest about its architecture. The physics of the problem do not allow synchronous computation of all signals within that window from a cold start.
The right framing is not "how do we eliminate these tradeoffs" but "are these tradeoffs acceptable for the use case?" For inline payment authorization fraud scoring, they are. For high-stakes decisions that can afford more latency — like suspicious account opening reviews or large-value wire approvals — you would make different architectural choices, use more signals, tolerate higher latency, and get better precision in exchange.