From REDCap Data to IND Submission Without the Copy-Paste Marathon

I've watched a regulatory affairs manager spend nine working days converting a clean REDCap export into a Module 2.7 Clinical Summary draft. Nine days. Of a smart, ICH-fluent person. Copy-pasting numbers into Word, reformatting tables, looking up the original adverse event narrative, cross-checking the protocol version, retyping the safety listings.

The data was fine. The CRFs were locked. The DSMB had signed off. The work that took nine days wasn't analysis. It was translation between two formats: one captures data efficiently, the other has to read like a regulator wants.

This post is about that bridge. What ICH M4 actually requires. Why REDCap (and most academic EDCs) leave you with a translation problem. And what the bridge looks like when you stop doing it by hand.

The format mismatch nobody warned you about

REDCap stores data as a relational schema with one row per observation. The IND needs your data as narrative summaries, structured tables, and individual study reports per the ICH M4 Common Technical Document format. These are different shapes.

To make an IND from a REDCap-based study, you need to produce, at minimum:

Module 2.5 Clinical Overview: narrative, 30-pages-ish, your story of the program
Module 2.7 Clinical Summary: comprehensive factual summary with structured tables for biopharmaceutic, pharmacology, efficacy, safety
Module 2.7.4 Summary of Clinical Safety: the one that drinks the most data
Module 5.3 Clinical Study Reports: full ICH E3-formatted CSRs

And if your IND draws on preclinical data (most do), you also need Module 2.6 Nonclinical Written and Tabulated Summaries and Module 4 study reports, but those typically come from your CRO, not REDCap.

REDCap can dump you a CSV. It cannot dump you a Module 2.7.4. The gap between those two artifacts is where the nine days lives.

What's actually manual today

A typical handoff looks like this.

The RA pulls a locked REDCap export. CSV plus syntax file, variables coded as ae_001_term, ae_001_grade, ae_001_action, hundreds of columns. They open the syntax file in R or SAS, run it, get a labeled dataset. If a field type changed since the last export, the syntax breaks here. Cue 90 minutes of debugging.

Now the tables. AE incidence by SOC, treatment-emergent AEs, SAEs, deaths, discontinuations. ICH E3 Section 12 lists exactly which tables you need; REDCap doesn't know that, so the RA builds them in R or Excel.

The narratives are heavier. ICH E3 Section 14.3 requires a narrative summary for each death, each SAE, each discontinuation due to AE — pulled from the AE form, the medical history, the concomitant meds, the lab values, the protocol deviations. Five sources, one paragraph. Repeat 40 times if you had 40 SAEs.

The worst part is the protocol cross-reference. Was the inclusion/exclusion described in 2.7.3 the version actually used? Was there an amendment? Which subjects were enrolled under v1 vs v2? REDCap will tell you the date a subject was enrolled. It will not tell you which protocol version was in effect on that date — unless someone on your team logged it as a separate field. Most teams didn't.

Medical writing takes whatever the RA hands them and turns it into prose. Two more weeks.

End to end: 4 to 12 weeks for a single Phase 1 study. For a Series A biotech burning $300-500K/month, that's $300K to $1.5M of runway burned on reformatting.

Where the cost actually compounds

The copy-paste isn't even the worst part. The worst part is what happens after, when FDA comes back with a question.

A reviewer asks: "In Table 2.7.4-3, the incidence of Grade 3 ALT elevation is 8.5%. In CSR Section 12.4, the same dataset shows 9.1%. Please reconcile."

Now someone has to find the source data, figure out which export each summary was based on, identify the data lock that happened in between, and re-do the table. If your Module 2 was assembled by hand from one CSV and your CSR was assembled by hand from a different CSV, this question eats a week. Sometimes two.

This is also the moment you discover that nobody documented which export fed which document. The lineage is in someone's head. Or worse, in someone's email thread from six months ago.

What an orchestration layer actually does

Three concrete pieces, in order of how often they save your week.

The first is a persistent mapping from source fields to CTD locations. The system knows that REDCap field ae_001_grade maps to Module 2.7.4 Section 2.1.3 ("Adverse Events by Severity"). When the export changes, the mapping doesn't. When you re-pull data after a database lock, the same numbers land in the same places.

The second is a rules engine that knows what each section needs. For Module 2.7.4 specifically, ICH M4E(R2) lists 14 required subsections. We've encoded them with the source data each one requires: incidence tables, time-to-onset, exposure, demographic stratification. Run your data through it and we tell you which subsections are complete, which are partial, and which can't be populated from what you've captured.

The third is citation-grade traceability. Every number in every summary points back to the source row in the export. When FDA asks "where does 8.5% come from," you click the cell and you see the subjects, the visits, the values. No reconciliation week. No email archaeology.

We don't do data capture. We don't replace REDCap or Medidata or Veeva. We sit on top of the data you already have and turn it into a dossier FDA can read. In the design partner work we've run so far, a 4-12 week translation job collapses to 5-10 days of medical writing on a draft most of the way there.

What this looks like in practice

If you're 6-12 months out from an IND filing and the study runs on REDCap, here are the moves I'd make today, in order of pain they'll save you.

Lock down protocol version per subject. Add a protocol_version_at_enrollment field. You will need it for Module 2.7.3 Summary of Clinical Efficacy. Adding it now costs nothing. Adding it during submission prep costs three days of chart review and a sleepless project manager.

Build the AE coding dictionary up-front. MedDRA SOC + PT codes embedded in REDCap, not assigned later in Excel. Saves the medical coder a week and removes one entire category of late-stage error.

Pick a single source of truth for each table. Either the CSR is master and Module 2.7 derives, or the orchestration layer maintains both from the same source export. Pick one and write it down. The "two parallel manuscripts" approach is exactly what produces the Section 12.4 vs 2.7.4 reconciliation problem we talked about above.

Set up the CTD structure before you have anything to put in it. Most teams wait until they have data to start thinking about Module 2 layout. This is backwards. The structure is fixed by ICH M4: folders, section headers, placeholder tables can all go in weeks before lock. When data arrives, it lands in shape — not into a 4-week reformat.

If you're at this stage and want to see what the bridge looks like on your data, start a workspace — paste a protocol draft and Regfo runs the gap check in 20 seconds. We're also taking a small number of design partners on the IND-prep side; the workspace is the fastest way to see whether we'd be useful before any call.

Related reading:

You've Outgrown REDCap — 8 Signs It's Time to Move On — symptoms checklist
CTD Module 2.4 Nonclinical Overview Guide — the section RA spends the most time on
Common IND Deficiencies and How to Avoid Them — what FDA flags most
Clinical Holds: Why FDA Stops Trials — what's at stake if the gaps slip through

The format mismatch nobody warned you about

What's actually manual today

Where the cost actually compounds

What an orchestration layer actually does

What this looks like in practice

Check your studies against 1,054 rules