Can You Trust AI in Regulatory Work? An Honest Answer

The question I get most from regulatory affairs leads at small biotechs is some version of: "If I let AI touch our IND prep, am I going to find myself in front of FDA explaining why a hallucination ended up in Module 2.7?"

It's the wrong question. The right question isn't "is the AI accurate." It's "can I show, in writing, why every output looks the way it does." Those are not the same thing, and once you understand the difference, the trust problem becomes solvable.

What "trust" means to FDA

When FDA reviewers look at any system that made your regulatory data, whether it's yours, your CRO's, or an AI's, they aren't asking whether it got the right answer. They're asking five things, in order:

Where did this number come from? (Source attribution)
Can I reproduce it? (Determinism, version pinning)
Who reviewed it? (Human accountability)
What was the system doing when it produced this? (Audit trail)
Was the system fit for purpose? (Validation)

"Is the answer correct" isn't on that list. It's implied (the answer has to be defensible), but FDA doesn't validate your math; they validate your process for producing math. This is why 21 CFR Part 11 exists, and why GxP environments lean on computer system validation rather than output spot-checks.

If your AI system can answer those five questions, FDA can work with it. If it can't, the model can be 99.9% accurate and still get rejected because the reviewer can't see how it got there.

The five questions, applied to AI

Source attribution. When a regulatory AI flags "your repeat-dose toxicology study is missing recovery animals per ICH M3(R2)," it has to cite the exact section of ICH M3(R2) it's reading from, plus the section of your protocol that triggered the flag. No citation = no trust. This is the easiest test to apply, and most general-purpose AI tools fail it immediately.

Reproducibility. If you run the same protocol through the same system tomorrow, do you get the same finding? LLMs are non-deterministic by default. Production-grade regulatory AI pins model versions, prompt versions, and rule-engine versions. When FDA asks "what version of your tool produced this Module 2.7," there has to be an answer that doesn't change.

Human accountability. No AI signs an IND. A human regulatory affairs lead does. Any AI tool that frames its output as "the answer" rather than "a finding for your review" doesn't match how RA actually signs off. Look for tools that surface findings with severity ratings, source citations, and explicit human sign-off, not tools that auto-submit anything.

Audit trail. Every interaction logged. Every input, every output, every override. If you tell the AI "ignore this finding because we have a waiver," that override has to land in a queryable log with the reason and the user. Consumer-tier chatbots (the free or personal-plan versions of ChatGPT, Claude, Gemini) are out of bounds for IND-bound work: no Part 11 e-signatures, no validated environment, no DPA at the level pharma legal will sign. Enterprise tiers from those vendors get closer but you still need to validate the deployment for your specific use case.

Validation. GAMP 5 is the framework most pharma companies use for validating computer systems. AI complicates GAMP 5 because the "system" includes a model whose behavior you don't fully control. FDA's 2025 AI in Drug Development discussion paper lands on a risk-based answer: validate the system for the specific use case, document the validation, and re-validate when the model changes.

What's safe to deploy right now

Gap analysis against ICH/FDA guidelines. AI reads your draft protocol and flags missing elements with citations. Output is a finding, not a conclusion. RA reviews each one. Safe.
CTD section drafting from structured data. AI populates Module 2.6 tabulated summaries from your tox study tables. Numbers come from the source. RA verifies and signs off. Safe with traceability.
Adverse event narrative drafts. AI produces a first draft of an ICH E3 Section 14.3 narrative from structured AE data. Medical writer edits and signs. Safe.
Semantic search across your data. AI finds "all Grade 3+ ALT elevations not reported within 7 days" across your captured data. Returns the rows, not a summary. Safe.
Citation lookups. AI surfaces the exact ICH section relevant to your question, with the quote. In our internal estimate this replaces something like 200 hours per IND of CTRL+F through PDFs. Safe.

What isn't safe yet

The pattern across each of these: the AI gets close enough to demo well, but not close enough that the output survives a review.

Auto-generating CRFs from a protocol. A vendor pitched a Series A team I work with on exactly this last month. The generated CRFs missed eligibility branching, AE coding wasn't tied to MedDRA hierarchies, and the protocol-version-at-enrollment field wasn't there at all. Two weeks fixing what was supposed to save them weeks. Auto-CRF is the most over-promised feature in the category right now.
Auto-coding adverse events. MedDRA coding requires medical judgment. Anyone selling auto-AE-coding without human review is selling you a future audit finding.
Auto-submitting anything to FDA. No AI should send a 1571 or amendment without explicit human approval. If a vendor pitches this, end the meeting.
Replacing pre-IND meeting strategy. AI can help you prepare. It can't anticipate what your reviewer will care about based on division-level patterns, recent FDA letters, and political wind. That's a human RA lead's job.
Anything in clinical operations beyond suggestions. Patient-facing decisions, dose adjustments, eligibility determinations: not AI territory. Tools that blur this line are taking on liability you shouldn't accept.

How to evaluate a regulatory AI vendor

If you're being pitched, here's the due-diligence checklist I'd run before any pilot:

Show me a finding with full citation. Section, paragraph, page. If it's vague ("ICH guidance suggests…"), walk away.
Run the same input twice; show me the outputs are identical, or that any difference is logged with a reason.
Show me your audit log for a real user session. Inputs, outputs, overrides, timestamps, user IDs.
Walk me through your validation package. GAMP 5 risk assessment, IQ/OQ/PQ documentation, change-control SOP. If they look confused, they're not enterprise-ready.
Show me your data residency and Part 11 controls. Where does protocol text live? Who has access? How are signatures handled?
Ask what they explicitly DON'T do. Vendors with mature thinking have a list. Vendors selling magic don't.

If a vendor passes all six, they're worth a pilot. Most don't pass three.

How Regfo handles this

We're not going to claim we pass every box for every customer. We're a Series A-stage tool with 25+ FDA/ICH study types analyzed and a rules engine of 373 structured requirements. Specifically:

Every finding cites the exact ICH/CFR section, paragraph, and the source location in your protocol.
Rules engine is versioned. Same input + same version produces same finding. We log version with each report.
Output is always a finding for human review, never a conclusion. RA signs off. We don't autosubmit anything.
Audit log captures every interaction; we'll show you the schema before pilot.
We don't do data capture. We don't replace your EDC. We sit on top of your captured data and your draft documents.

Where we're explicit about not being there yet: we're not a fully validated system in the GAMP 5 / Part 11 sense, and we don't pretend to be. We tell pilot customers this upfront. If you need a fully validated platform for late-phase work, you're not our customer yet; Veeva is. We're built for the pre-IND, pre-Phase-1 prep where speed matters and the human regulatory lead is making every final call.

The honest answer

Yes, with the same discipline you apply to every other tool that produces data your team will sign for: citations, audit trails, version pinning, human review. The question stops being about AI specifically and starts being about boring regulatory discipline. That's a problem the industry already knows how to solve.

The first checklist item is "show me a finding with full citation." Paste a draft protocol into Regfo and you'll get exactly that in 20 seconds: the ICH section, the paragraph, and the line in your protocol it came from. Run the same six-question test on us that you'd run on any other vendor.

Related reading:

You've Outgrown REDCap. Here's How to Tell. — when academic tools hit industry workflow
From REDCap Data to IND Submission Without the Copy-Paste Marathon — the bridge piece
Common IND Deficiencies and How to Avoid Them — what FDA flags, with or without AI
How to Check Preclinical Studies Against ICH Guidelines — the gap-analysis playbook