Insights & News

AI Device Study Design: FDA Data Requirements Guide

March 3, 2026

MedTech

SaMD & Digital Health

AI/ML

Compliance & Regulatory: MedTech

Understand FDA's expectations for clinical validation data — including sample size justification, dataset diversity, and subgroup analyses.

Clinical Validation: The New Norm

If you have a Class II device with a suitable predicate, you may not need clinical data to go to market. Instead, nonclinical testing may be enough to demonstrate the substantial equivalence of your device to the cleared predicate device. However, this rule of thumb rarely applies to Artificial Intelligence/Machine Learning (AI/ML)-enabled Software as a Medical Device (SaMD) technologies designed for processing and interpreting data. Even if the intended use and technological characteristics (e.g., user interface) are similar between a predicate and subject AI/ML SaMD, the algorithms have different origins—they’re built independently and trained on different data. They may serve the same purpose, but they likely do so very differently. Solely proving that your software is safe via unit and functional testing is not sufficient. FDA needs to see that it works on real patients.

Don’t panic – these studies are (usually) not as burdensome as those required for a De Novo or PMA submission. Instead, they are often able to leverage retrospective data and typically don’t require an IDE submission or significant IRB review, given that patients or patient-identifiable data are usually not directly involved. However, FDA still has specific requirements for clinical validation data, and it’s important to understand what they’re looking for before you engage in a study.

Sample Size

Determining an appropriate sample size for your study can be difficult. It requires finding a balance between ensuring you have enough data to demonstrate safety and effectiveness for your device and optimizing both money and time. You may think it’s wise to run your study with a sample size identical to that used for the predicate. If 300 subjects was enough for that device, surely, it’s enough for this one. Unfortunately, this strategy almost never works with FDA. Instead, it’s best practice to develop a statistical justification for your sample size based on your patient population and primary endpoints. If you go to FDA with a reasonable argument for the number you have in mind, you have more solid ground to stand on. It may be necessary to consult with an external statistician in this pursuit, unless you have an internal resource who’s up to the task.

Data Gathering and Generalizability

Once you know the amount of data you’ll need, it’s important to consider its composition. Data are often gathered retrospectively for AI/ML validation, which presents sponsors with an opportunity to “cherry-pick” the data they want or need. That being said, FDA will notice if the dataset is biased to any one site, geographical region, race, gender, age, pathology, socio-economic status, etc., and stresses the importance of having unbiased, representative data in a draft guidance document: Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations. The document even includes recommendations for including a description of the model’s development data and performance characteristics within user labeling. In short, it’s not permissible to simply select data that you think your software will perform well on. The dataset must be generalizable to the product’s patient population.

To do so, it may be helpful to pre-specify inclusion and exclusion criteria in a way that allows for broad gathering of data reflective of the intended patient population. For example, if CT scans are needed for a study, inclusion criteria should specify the use of all scans available from the indicated population (e.g., non-contrast abdominal scans from adults) that were performed during a particular time period, preventing the exclusion of more difficult cases.

Site Generalizability

To further ensure generalizability, FDA review teams request that data used in these efforts are gathered from at least 3 sites from significantly different geographical regions in the US in order to capture various racial and ethnic subpopulations, hospital settings (e.g., academic hospital, community hospital), clinical practices, and associated technologies (e.g., electronic health record (EHR) systems, image acquisition methodologies).

Additionally, it is important to ensure that data are gathered from sites distinct from those used for algorithm training to demonstrate that the device is not overfitted to a particular site. Overfitting describes a phenomenon that occurs in which a model may learn to perform particularly well on one type of data but is unable to demonstrate equivalent performance on new data. FDA wants proof that the algorithm is not overfitted and, therefore, is generalizable to the entire intended patient population. For FDA review, this proof is expected to be provided in the form of evidence from multiple new (non-training) sites.

A note on gathering data: depending on the device type, a validation study may include either prospective or retrospective data. If you’re gathering retrospective data, electronic data (e.g., medical imagery, electronic health records), it may be helpful to review this recently updated FDA guidance document: Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices - Guidance for Industry and Food and Drug Administration Staff. This document now specifies that FDA will now accept evidence even in the case that data is from de-identified, aggregated sources (e.g., national disease registries, insurance claims databases, electronic health record networks), expanding the options available to sponsors in validating their device.

Subgroup Analyses

It’s not enough to claim that your dataset is diverse and that your algorithm performed well for every input – you have to prove it. FDA typically requests subgroup analyses of results for the clinical validation of AI/ML software – per site, race, gender, age, and any confounding variables relevant to your claims. This may reveal insufficient sample size of a key subgroup, for example, low representation of a race as compared to the expected racial distribution within the indication. However, this issue may be addressed via the addition of enriched datasets to provide additional data regarding the performance of the algorithm on the subgroup. This may involve adding data associated with a particular subgroup or gathered from a geographical area that was underrepresented in your initial dataset. In some cases, it may also be necessary to add datasets that include particular pathology or disease states that are relevant to your patient population and were not appropriately explored in the initial dataset. However, this approach should be discussed with FDA prior to implementation due to the risk of bias associated with post-hoc introduction of data.

For information direct from source, it may be helpful to read this FDA guidance document: Evaluation and Reporting of Age-, Race-, and Ethnicity-Specific Data in Medical Device Clinical Studies.

Conclusion

Planning a clinical study can be stressful, as it requires a lot of energy, time, and money. To reduce the associated burden, sponsors should focus on providing the minimum required data to FDA without cutting corners. This means having an appropriately sized dataset that is diverse with respect to geography, race, sex, age, and any other variable that may be relevant to consider within the software’s indication. If there is any doubt regarding a particular aspect of the clinical validation effort, sponsors should consider engaging in Pre-Submissions with FDA to confirm details before initiating the study. Otherwise, a sponsor may find themselves repeating a study to address FDA’s concerns.

Proxima has provided support across multiple disciplines in protocol development and the drafting and submission of Pre-Sub packages. Learn more about our experience and get in touch with our regulatory, quality, and clinical experts.

About the Author:

Ellie Draper, MBE

Director of Regulatory Affairs and Quality Assurance

At Proxima, Ellie has been critical in contributing to and authoring regulatory strategies and assessments, pre-submission, pre-IND, IND, and IDE packets, and 510(k), De Novo, and PMA submissions for products across a range of indications, including cardiovascular, gastrointestinal, urology, surgical, diagnostic, and pediatric-specific devices. She has also been instrumental in QMS development and integration projects for both large and small companies.

Learn More