Will Your AI Compliance Vendor Train on Your IND? 7 Questions to Ask
In March 2023, Samsung semiconductor engineers pasted proprietary information into ChatGPT in three separate incidents inside of 20 days: source code from a semiconductor measurement database, code used for identifying defective chips, and a transcript of an internal meeting about unreleased process technology. Samsung had lifted its ChatGPT ban for the semiconductor division in early March. By April 1, internal security flagged the leaks. On May 1, 2023, Samsung banned all generative AI on corporate devices and networks (CIO Dive, April 2023).
That was the consumer-grade ChatGPT web interface. Not an API. Not an enterprise agreement. A browser tab.
Every pharma QA lead I talk to right now is doing some version of the Samsung thing and trying not to panic about it. Someone on the team pasted a protocol snippet into ChatGPT. Someone else uploaded a DCM application draft to a free AI "reviewer" they found on Product Hunt. Now the company is evaluating an actual AI compliance vendor, the kind that reads your IND or your CFTC Part 38 filing and tells you what's missing. And the first question on everyone's mind is the one nobody wants to say out loud:
Are you going to train on my data? Will your next customer be able to see what's in my submission?
I built one of those tools. I've sat across from biotech CISOs and finance compliance officers who asked that question, some politely and some not. The honest answer: most AI compliance vendors don't deserve a straight yes, because the real answer has five layers, each of which a marketing page will quietly conflate into one reassuring sentence.
Disclosure up front: I built Regfo, so I'm biased. This post explains the due diligence I'd do if I were buying an AI compliance tool instead of building one. At the end I explain exactly how Regfo handles this stuff, including what we don't have (SOC 2) and which third-party APIs our data touches (spoiler: it touches them).
What "We Don't Train on Your Data" Actually Means
This is the sentence every AI vendor uses. It's technically true most of the time. It's also incomplete in ways that matter.
"Training" is one specific thing: taking your inputs and using them to update model weights so the next version of the model can produce similar outputs for someone else. When OpenAI says they don't train on API data, they mean they don't feed your prompts into the next foundation model's gradient descent. That policy has been in effect since March 1, 2023 for API traffic (OpenAI business data page).
Here's what that sentence does not cover by default:
- Retention. Prompts and completions sit in the vendor's logs. OpenAI's default is 30 days for abuse monitoring. Anthropic governs retention under its commercial data retention policy and offers Zero Data Retention for eligible APIs by arrangement (Anthropic API and data retention docs). Short. Not zero.
- Human review. "Abuse monitoring" means humans can read flagged content. Trip a classifier (say, a toxicology study that mentions pediatric dosing) and a contractor somewhere might look at it.
- Fine-tuning. Different from foundation training. Some vendors offer "your custom model" features that fine-tune on your data. Read the checkbox.
- Eval sets. Anthropic, OpenAI, and Google all maintain internal evaluation suites. Some include real customer examples the customer didn't realize they'd agreed to contribute.
- Subprocessors. Your AI vendor probably runs on AWS or GCP. Your BAA needs to cover them too.
- Court orders. Between May and September 2025, OpenAI was under a federal preservation order in New York Times v. OpenAI requiring it to preserve ChatGPT output logs that would otherwise have been deleted, including deleted and temporary chats. Only ZDR API customers and ChatGPT Enterprise/Edu were excluded from the scope. The preservation requirement was terminated in late September 2025, but the event is the precedent (Wald.ai analysis, Justia docket).
Read that last one twice. Your vendor's privacy promise is only as strong as the subpoenas their legal team receives. If a plaintiff in a case you're not involved in gets a broad preservation order, your data stays. That order has been terminated, but the next one hasn't been written yet. "We don't train on your data" is the tip of a much longer iceberg.
The 7 Questions to Ask Every AI Compliance Vendor
I phrase these the way I'd actually ask them in a procurement call. Don't accept "yes" as an answer to any of them. Ask for the specific policy URL, contract clause, or technical control.
1. Where does my document actually go, byte by byte?
Draw the path. Your SharePoint → vendor's upload endpoint → their storage (S3? Postgres?) → LLM API call → LLM provider's servers → response → vendor's DB → your screen. Every hop is a trust boundary with its own retention policy. Most vendors can't draw this diagram without three follow-up emails. Red flag by itself.
What you want: named storage systems, named LLM providers, named regions, retention in days at each stage. If they say "encrypted cloud storage," make them say which cloud.
2. Which LLM provider are you calling, and what's their retention policy?
Almost every AI compliance vendor is a wrapper around OpenAI, Anthropic, or Google. That's fine. It means two vendors handle your data, not one, and the LLM provider's terms apply. The three you'll hear named:
| Provider | Default retention (API) | Zero-retention option | HIPAA BAA |
|---|---|---|---|
| OpenAI | 30 days for abuse monitoring, in place since March 1, 2023 (docs) | Yes, by application via sales; excludes content from abuse logs | Yes, covers API, ChatGPT Enterprise, and ChatGPT Edu (help article) |
| Anthropic | Per-feature retention under Anthropic's commercial data retention policy; most features retain the "shortest practical TTL" (API retention docs) | Yes, for eligible APIs (Messages, Token Counting) by approved enterprise arrangement | Yes, covers the Claude API, Claude Enterprise, and Claude Code (ZDR-enabled) (privacy center) |
| Google (Vertex AI / Gemini) | No at-rest storage for published Gemini model requests; prompts and responses not used to train (Vertex AI ZDR docs) | Yes, ZDR configuration available on Vertex AI | Yes, for eligible Google Cloud and Workspace customers under a signed BAA |
If your vendor names a fourth option like "our own self-hosted model" or "a custom fine-tuned Llama," ask where they run inference. A self-hosted model on their AWS doesn't solve the retention problem, it just moves it.
3. Do you have a Zero Data Retention agreement with your LLM provider?
The most important technical control for a compliance tool handling regulated documents. ZDR means the LLM provider doesn't persist your prompts or completions beyond the inference call. No 7-day log. No 30-day abuse window.
Most small AI startups do not have ZDR. Approval takes a sales conversation with OpenAI or Anthropic and usually a volume commitment. If your vendor can't produce the agreement or equivalent proof, assume default retention.
Follow-up: does ZDR apply to every endpoint they use? OpenAI's ZDR is endpoint-specific. A vendor might have ZDR for chat completions but not for the Assistants API or file search.
4. Show me the SOC 2 report. Type II, not Type I. And let me read the exceptions.
Here's where most buyers get fooled. They see a "SOC 2 compliant" badge on a vendor's footer and stop asking.
- SOC 2 Type I is a point-in-time snapshot. "On this day, our controls were designed correctly." Can be earned in weeks (A-LIGN).
- SOC 2 Type II tests whether those controls actually operated correctly over 3–12 months. This is the one that matters.
- SOC 2 is a report, not a certification. Every report has exceptions. Vendors don't put those on the marketing page. You have to read the PDF (OneTrust myths).
Ask for the full Type II report under NDA. If they'll only send a summary letter or a Type I, you don't have the evidence you think you have. If they refuse to share it at all, walk.
One more thing: your vendor's SOC 2 doesn't cover their subprocessors. Each party in the chain needs its own (Cynomi, 2024).
5. Who can see my data internally at your company?
"Employees with a business need" is the default answer. Push back. Ask:
- Do engineers have production database access?
- Are prompts or uploaded documents reviewed by humans for model improvement?
- What's the logging policy for customer support interactions?
- Break-glass procedures, and who approves them?
The answer you want: no human at the vendor reads your documents unless you open a support ticket and explicitly share them. Ask them to point to the specific access control in their SOC 2 Type II or in a DPA clause.
6. What happens if I churn? And if you get acquired?
Contracts should spell out:
- Deletion on termination. "We'll delete within 30 days" in writing, plus a certificate of deletion on request.
- Data portability. Can you export documents, embeddings, chat history, and analysis results?
- Assignment. Does your DPA survive acquisition? "Binds successors" needs to be a clause, not a handshake.
- Bankruptcy. If the vendor folds, who owns the data? The estate sells assets, and "customer database" is an asset. Your DPA should prohibit transfer.
7. Do you sign a BAA if I'm pharma? Do you have a Part 11 and GxP validation story?
HIPAA BAAs aren't just for HIPAA. They're a forcing function that makes the vendor commit, in writing, to data handling practices your compliance team can audit. Ask for one even if your use case isn't strictly PHI. Most AI compliance vendors can't sign one because their LLM subprocessor won't. That tells you something.
On Part 11: FDA has not yet issued AI-specific Part 11 guidance. The January 2025 draft guidance on AI in regulatory decision-making uses a 7-step credibility framework (Intuition Labs, 2025). Your vendor should be able to tell you whether their system produces electronic records you'd rely on for a submission, and what their validation story looks like for GLP and GxP contexts. If they look blank, they're not selling to regulated industries yet. You'd be their trial.
Consumer Chatbot vs Enterprise API vs On-Prem: A Side-by-Side
Not all AI is the same risk class. If you're not sure which bucket your vendor fits into, here's the translation:
| Consumer ChatGPT / Gemini / Claude.ai | Enterprise API (direct) | AI vendor built on enterprise API | Self-hosted / on-prem LLM | |
|---|---|---|---|---|
| Default training on inputs | Yes (unless opted out) | No (since March 2023 for OpenAI) | No (inherits API terms) | No |
| Default retention | Until you delete | 7–30 days for abuse monitoring | Vendor's own DB + LLM provider logs | Your infrastructure |
| Zero-retention available | No | Yes by approval | Only if vendor negotiated it | N/A |
| BAA available | No (except ChatGPT Enterprise/Edu) | Yes | Only if vendor signs | You sign with yourself |
| Legal preservation risk | High (NYT order covered 400M users) | Medium (ZDR excluded from NYT order) | Inherits from LLM provider | Low |
| Good for regulated IND/CFTC data | Absolutely not | Sometimes | Sometimes, with homework | Usually yes, at cost |
The Samsung engineers were in column 1. The NYT preservation order hit column 1 and most of column 2. An AI compliance vendor is in column 3, which can be fine if they've done the work, or a liability if they haven't. Column 4 is what Veeva does for its largest pharma customers. It's not cheap.
How Regfo Handles It (Honest Version)
Here's where I tell you what we actually do. No marketing flex. I'll call out the things we don't have.
LLM providers. We use Google Gemini (primarily Gemini 2.5 Pro and Flash) and OpenAI (GPT-4o-mini and GPT-5 series) via their official APIs. Both are covered by the training restrictions above. Our config/models.json has the model chain; every model in it is a commercial API. We do not run our own foundation model. We also don't use the consumer ChatGPT, Claude.ai, or Gemini web apps for any customer data.
Embeddings. We embed document chunks using Google's gemini-embedding-001 (3072 dimensions) and store them in our own Postgres database with pgvector. The embeddings don't leave our infrastructure after creation. The text the embeddings came from does go to Google's API once, during the embedding call.
Workspace isolation. This is the part I wrote myself and can vouch for. Every chunk in our vector store carries a workspace_id. Every query is filtered strictly by that ID before the vector search runs. Not as a post-filter. As a SQL WHERE workspace_id = :workspace_id clause in the query itself. You can read that code; the rag_service.search_by_vector function does exactly this. There is no scenario in our codebase where Customer A's query can surface Customer B's chunks. I'm willing to walk through the code on a call.
Auth. Workspace ownership is enforced at the API layer: every workspace endpoint checks Workspace.user_id == current_user.id before returning data. A user can't fetch another user's workspace by guessing the UUID.
Storage. Uploaded documents currently live on encrypted disk on our application server (the LocalStorage class in services/storage.py). We don't use S3 yet. This is one of the limitations of being a startup. We haven't migrated to object storage with per-tenant keys. If you're Series C with a CISO who mandates per-tenant KMS, we're not ready for you yet.
SOC 2. We do not have SOC 2 Type II. We don't have Type I either. We're an early-stage startup and a full Type II audit takes 6–12 months of operating history to even start. If your procurement requires SOC 2 Type II as table stakes, we can't help you today. Ask us again in 2027.
HIPAA. We don't currently sign BAAs. Our LLM subprocessors (Google, OpenAI) offer BAAs to their direct customers, but we haven't executed the downstream flow. If you're putting PHI into a compliance tool, we're not the right vendor today.
Zero-retention with our LLM providers. Today we operate on default retention with Google Gemini (24-hour in-memory cache, no at-rest storage for published models) and OpenAI's default 30-day abuse-monitoring log. We have not negotiated custom ZDR agreements with either yet. For customers with strict ZDR requirements, this is a real limitation. Tell us during the sales call and we'll be honest about whether we can meet your bar.
What we won't do. We won't fine-tune on your documents. We won't share them with any other customer. We won't use your content for marketing without written consent. We won't sell your data, ever. These go in the DPA, not just on the landing page.
Want the full architecture walkthrough? Email us. I'll get on a call and show you the code paths.
When to Walk Away from an AI Vendor
Some red flags are dealbreakers. If you see any of these in a vendor call, stop the procurement.
- "We use our own proprietary AI model." Unless they've raised $100M+ for inference infrastructure, they're wrapping an API and don't want to say which one. Find out which.
- They won't tell you their LLM subprocessor. This is a no.
- They share a SOC 2 summary letter instead of the full report. The summary is marketing. The report has the exceptions.
- Their DPA doesn't survive acquisition. One Y Combinator acquihire and your data is a line item in someone else's data room.
- They can't explain their retention timeline in days. "We retain data only as long as necessary" is not an answer. Ask for a number.
- Free tier with no DPA option. Fine for testing with synthetic data. Not fine for your real preclinical package or CFTC Part 38 draft.
- They say "HIPAA compliant" without naming a BAA. HIPAA compliance is an operational posture, not a certification. If no BAA, there's no enforceable HIPAA commitment.
- They promise "zero data retention" without specifying what endpoint. ZDR is endpoint-specific at the LLM provider level. "Zero retention" as a blanket marketing claim is either imprecise or untrue.
If you walk through the 7 questions above and the vendor passes all of them, you've done more due diligence than 95% of AI buyers in 2026. If they fail three or more, move on.
The Real Takeaway
The vendors who deserve your IND or your DCM application can answer every one of the seven questions above with specifics, URLs, and a copy of their DPA. The ones who can't are asking you to trust them. Trust without verification in regulated industries is how companies end up in enforcement actions they didn't see coming.
I'd rather lose a sale because you asked us hard questions than win one because you didn't. Run the checklist against us, I'll go line by line. Run it against someone else first, good. That's what this post is for.
Related reading:
- Best FDA compliance software for biotech in 2026
- CFTC compliance tools for DCM designation
- GLP compliance checklist for preclinical studies
- CTD Module 2.4 nonclinical overview
- Safety pharmacology under ICH S7A and genotoxicity battery under ICH S2
Sources cited in this post: OpenAI business data page, OpenAI data controls docs, OpenAI BAA help article, Anthropic API retention docs, Anthropic BAA privacy center, Anthropic ZDR privacy center, Vertex AI zero data retention, Samsung ChatGPT leak, CIO Dive (April 2023), NYT v. OpenAI preservation order analysis, Wald.ai (2025), SOC 2 Type I vs Type II, A-LIGN, SOC 2 myths, OneTrust, SOC 2 MSP misconceptions, Cynomi, FDA AI guidance January 2025 via Intuition Labs. Last updated April 2026.