A card gets swiped in Lisbon at 2:14 a.m. The cardholder lives in Denver. Somewhere a model has about 50 milliseconds to decide: approve, decline, or step up to a one-time code. Get it wrong by declining a real customer and you've insulted someone at a restaurant. Get it wrong by approving a thief and the bank eats the chargeback.
That decision has been made by machine learning for fifteen years. Gradient-boosted trees, mostly. Boring, fast, well-understood.
So here's the question everyone in regulated software is quietly asking right now: with LLMs reading contracts and writing code and passing the bar exam, why is that fraud decision still made by a 2015-era boosting algorithm? Why didn't the big language models eat fraud detection the way they ate everything else?
I find this interesting because fintech ran the experiment a couple of years before regulated science is about to run it. And the answer fintech landed on is not "AI won" or "AI was hype." It's something more useful, and it maps almost exactly onto the question we deal with at Regfo: how much of a high-stakes, auditable decision can you actually hand to a probabilistic model?
What "AI" and "ML" even mean here
Quick cleanup, because the words got mushy.
Traditional ML in fraud means a model trained on structured, tabular data. Amount, merchant category, time of day, distance from last transaction, velocity, device fingerprint. It outputs a score. XGBoost, LightGBM, CatBoost do most of the heavy lifting in production, and they're genuinely good at finding non-linear interactions in that kind of data.
"AI" in the current sense means large language models, and increasingly agents built on top of them. Reasoning over text, documents, narratives. The thing that can read an email and tell you the tone is off.
These are not two versions of the same tool. They eat different food. ML thrives on numbers in columns; LLMs thrive on messy language. That distinction is the whole story, and most "AI is replacing ML" takes miss it completely.
The three walls LLMs hit
When fintech teams actually tried to push LLMs into the live fraud decision, the same three walls came up every time.
Latency. You have tens of milliseconds at the point of sale. Small, purpose-built models serve a prediction inside that window. A cloud-hosted LLM call runs an order of magnitude slower, and that's before you retry a timeout. Real-time scoring is exactly what tabular ML was built for, which is why SEON and others still treat it as the workhorse of live fraud decisions. For a batch report you don't care. For a card swipe, that gap is the entire game.
Cost. A payment network scores billions of transactions. Run every one through a frontier LLM and the inference bill goes from rounding error to line item nobody will sign off on. Nobody is paying GPT-class prices to score a $4 coffee.
Explainability. This is the one I care about most, and it's where fintech and FDA work rub up against the same problem. When a fraud system declines a transaction, somebody eventually has to say why. A chargeback dispute, a regulator, an adverse-action notice: all of them want a reason, not a vibe. White-box and tree-based ML can hand you feature attributions and rule-like explanations. A systematic review of explainable AI for fraud detection spells out how much of the field is built around exactly that need: scores analysts and regulators can actually read. LLMs are still, structurally, harder to pin down. They'll give you a fluent explanation. Whether that explanation is the real reason for the decision is a different question.
So the LLM didn't lose on intelligence. It lost on milliseconds, dollars, and the ability to show its work. Three things that happen to matter enormously in any regulated decision.
Where AI actually won
Here's the part the "ML won, go home" crowd gets wrong. LLMs didn't fail at fraud. They got moved to the part of the job ML was always bad at.
Tabular ML sees a transaction as a row of numbers. It has no idea what the payment is about. Take an example Taktile uses: a customer sends a "tax payment" on a Sunday evening to a personal account in another country. Every number might look fine. The amount is normal, the account isn't flagged, the velocity is ordinary. The fraud is in the story. Tax offices don't take Sunday-night transfers to someone's personal foreign account. An LLM catches the narrative inconsistency a numeric model is structurally blind to.
That's the pattern. LLMs went to the edges:
- Reading the unstructured stuff (emails, support transcripts, KYC documents, merchant descriptions) that never fit in a feature column.
- Acting as an investigative partner. Instead of an analyst opening twelve tabs to assemble a case, the model drafts the narrative and surfaces the connected accounts.
- Cutting false-positive noise. Elastic reports agentic setups driving roughly a 40% drop in false positives in some deployments, mostly by adding context to alerts a rules-and-scores system flagged.
None of that replaces the scoring model. It wraps around it.
The shape that actually shipped
So the thing that won wasn't ML or AI. It was a division of labor, and once you see it you can't unsee it.
The deterministic, fast, explainable core makes the real-time call. The probabilistic, context-hungry layer handles the messy edges: investigation, unstructured signals, the judgment calls where being slower and fuzzier is acceptable because a human is in the loop anyway. IBM's writeup on AI fraud detection in banking lands on the same layered shape, with ML scoring the transaction and generative AI adding context around the alert. Not a replacement. A two-speed system.
Adoption backs it up rather than contradicting it. Industry surveys put AI use for fraud across most financial institutions, with a large share now deploying agents specifically for investigative tasks. But dig into where those agents sit and it's the edges, not the core swipe-time decision. The boring boosting model is still holding the line.
Why this is a Regfo post and not a fintech post
Because we're about to run the exact same experiment in FDA-regulated science, and I'd rather not relearn fintech's lesson the expensive way.
The pitch for AI in regulatory work sounds identical to the early fraud pitch: throw an LLM at the whole problem, let it read the protocol and the guidelines and just tell you if you're compliant. It's a seductive demo. It's also the part where milliseconds, cost, and "show your work" come back to bite you. Except in our world the third one isn't a chargeback dispute. It's a clinical hold and an FDA reviewer asking which specific requirement you missed.
A regulatory finding without a citation is worthless. If a tool tells you your safety pharmacology package has a gap but can't point to ICH S7A, you've got a vibe, not a finding. If it flags your genotoxicity battery as incomplete, you need the exact S2(R1) requirement that's unmet, not a confident paragraph. Same problem fintech had: the fluent explanation is not the same as the auditable reason.
So we built Regfo the way fintech ended up building fraud systems, mostly by accident before I'd ever read a word about any of this. The deterministic core is a rules engine: 373 structured FDA/ICH requirements that don't hallucinate and always cite the exact guideline section. That's the explainable, "show your work" layer. The LLM sits at the edge, reading the actual protocol text and catching the narrative problems a rule check can't see, the way the Sunday-tax-payment model does. Context from the AI, citation from the rules. When you read a generated clinical overview gap analysis, the judgment is the model's and the receipts are the engine's.
We didn't do it that way because we're clever. We did it because the pure-LLM version fails the same audit fintech's pure-LLM version failed. I've written before about whether you can trust AI in regulatory work. Short version: you can trust it exactly as far as it shows its sources.
The lesson, minus the hype
If you're putting AI into any decision someone can be forced to justify later, fintech already mapped the terrain.
Don't ask "can the LLM do the whole job." Ask which part of the job needs to be fast, cheap, and auditable, and which part needs to read messy language and reason about context. Those are different parts. Give them to different tools. The deterministic core earns the trust; the model earns its keep at the edges.
The teams that "abandoned AI" mostly didn't. They just stopped pretending one model does everything, and went back to letting boring, explainable ML own the decision while AI owns the context around it. In a regulated domain that's not a retreat. It's the only version that survives an audit.
If you want to see what citation-first looks like in practice, paste a protocol into Regfo and watch where the findings come from. Every gap points at a specific guideline. That's the part that has to be a rule, not a guess.
Related reading:
- Can You Trust AI in Regulatory Work? — how far AI judgment goes before it needs a citation
- Prompt Injection Attacks on AI Regulatory Compliance Tools — the other reason you don't hand the whole decision to an LLM
- FDA's Draft Guidance on AI-Discovered Drugs — where regulators are drawing the line on AI in submissions