So, you want to do GI clinical research but aren’t sure how to get started? Although randomized clinical trials provide the highest level of evidence to inform clinical practice, guidelines, and policy, they may not be possible for a fellow or junior investigator to initiate, and even in the best-case scenario, may not be published for many years. Database-oriented research can overcome many of the daunting impediments that many junior researchers face. Furthermore it can jump-start a clinical research career as a vehicle to develop knowledge and experience in epidemiologic analyses and publication. It offers a way to test clinically meaningful hypotheses and most importantly, provide salient data that could ultimately impact clinical decision making. What are the next steps to engaging in database research? Finding a mentor with experience in clinical and database research is critical. If necessary, mentorship can be split between a gastrointestinal (GI) clinical researcher who can help identify clinical questions and another mentor with experience in epidemiologic analyses (eg, health services researcher, epidemiologist, and biostatistician). The next step is to identify the clinical area within GI that is of most interest to you and possibly your future clinical focus (eg, cancer screening and prevention, motility, hepatology, inflammatory bowel disease [IBD]). Although ideally you would have a hypothesis to be tested, in database-oriented research, it can be advantageous to explore the specific database to see what hypotheses can be investigated, using an iterative approach to home in on a meaningful analysis. Because of this low barrier to entry however, a thorough literature review is important because many of the analyses that you may consider may have already been done. Thus, a willingness to be flexible is critical to success. In this piece, we provide synoptic review of numerous databases suited to GI-focused analyses (Table 1). The goal is not necessarily to be comprehensive but focused and provide illustrative examples with the ultimate aim to point an eager GI fellow or junior investigator in the right direction.Table 1Review of DatabasesDatabase nameAccessCostType of databaseData includedWhy you would consider using thisNHANEShttps://www.cdc.gov/nchs/nhanes/index.htmPublicly availableFreeNational SurveyInterview obtained medical information, physical exam and laboratory dataInvestigate diseases and associations with medications, physical exam, and laboratory dataNational Health Interview Surveyhttps://www.cdc.gov/nchs/nhis/index.htmPublicly availableFreeNational SurveyInterview obtained medical informationInvestigate association between diseases and health care access and barriersSEER/ SEER Research Plushttps://seer.cancer.gov/Publicly availableFor certain restricted data an application is required.FreeRegistryPatient demographics and cancer specific data (such as primary tumor site, morphology, and stage)Asses cancer incidence and mortality, survival, and limited duration prevalenceGI Quality Improvement Consortium (GIQUIC)https://giquic.gi.org/Application processFeeRegistryEndoscopy center and provider characteristics, patient demographics, endoscopy reports and quality metricsDescribe endoscopy and colonoscopy measuresVeterans Affairshttps://www.va.gov/vetdata/Affiliation requiredFreeElectronic health recordsEntire medical chartsEvaluate longitudinal care of large cohort of patientsMarketscanhttps://www.ibm.com/products/marketscan-research-databasesPublicly availableFeeClaims dataInsurance claims, patient demographics. Supplements include laboratory data, disability data, weather data and othersExamine treatment patterns, patient adherence, and natural history of disease across both inpatient and outpatient careSEER-MEDICAREhttps://healthcaredelivery.cancer.gov/seermedicare/Application processFeeCombined registry and claims dataIncludes SEER registry data along with inpatient, outpatient and medication claims dataPerform epidemiological and health services research in patients with cancerNationwide Inpatient Samplehttps://www.hcup-us.ahrq.gov/nisoverview.jspApplication processFee (discounts available for students)Combined registry and claims dataClaims data, patient and provider characteristicsEvaluate inpatient careAll of Ushttps://allofus.nih.gov/Two access types: anonymized aggregate data publicly available and more comprehensive individual data for registered researchersFreeProspectively enrolling registry, including survey data, electronic health records, wearable device data and biosamples; data are both prospective and retrospectivePatient health information from survey and electronic health records, biosamplesEvaluate the health of a large, diverse cohort of patients, including access to biosamples Open table in a new tab There are many survey databases obtained through participant interview that are readily available to researchers. Two examples are the National Health Interview Survey (NHIS) and the National Health and Nutrition Survey (NHANES), which are conducted by the National Center for Health Statistics (NCHS).1Centers for Disease Control and PreventionNational Center for Health Statistics National Health Interview Survey.https://www.cdc.gov/nchs/nhis/about_nhis.htmDate accessed: March 13, 2022Google Scholar,2Centers for Disease Control and PreventionNational Center for Health Statistics. About the National Health and Nutrition Examination Survey.http://www.cdc.gov/nchs/nhanes/about_nhanes.htmDate accessed: March 13, 2022Google Scholar Both are large national surveys, with NHANES including 5000 individual people per year and NHIS including 30,000 households per year. While both surveys collect sociodemographic, health, and disease information, NHIS is solely interview based and is focused on health care utilization and access. A recent study used NHIS to evaluate food insecurity, social support, and financial toxicity in patients with IBD.3Nguyen N.H. Khera R. Ohno-Machado L. Sandborn W.J. et al.Prevalence and effects of food insecurity and social support on financial toxicity in and healthcare use by patients with inflammatory bowel diseases.Clin Gastroenterol Hepatol. 2021; 19: 1377-1386.e1375Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar In addition, every 5 years, NHIS includes cancer-related questions,4National Cancer InstituteDivision of Cancer Control & Population Sciences. National Health Interview Survey (NHIS) Cancer Control Supplement (CCS).https://healthcaredelivery.cancer.gov/nhis/Date accessed: March 13, 2022Google Scholar which can be used to study cancer screening and risk factors, such as a recent study assessing adherence to colorectal cancer screening guidelines in African Americans.5Millien V.O. Levine P. Suarez M.G. Colorectal cancer screening in African Americans: are we following the guidelines?.Cancer Causes Control. 2021; 32: 943-951Crossref PubMed Scopus (3) Google Scholar NHANES, on the other hand, also includes data from physical examination and laboratory testing. Kim et al6Kim D. Vazquez-Montesino L.M. Escober J.A. et al.Low thyroid function in nonalcoholic fatty liver disease is an independent predictor of all-cause and cardiovascular mortality.Am J Gastroenterol. 2020; 115: 1496-1504Crossref PubMed Scopus (17) Google Scholar used NHANES to show that reduced thyroid function predicted mortality in patients with nonalcoholic fatty liver disease. Despite the advantages, these surveys may be difficult to navigate without a knowledge of statistics. To make the sample population’s answers to survey questions representative of a larger population, the samples need to be weighted, or corrected, by demographic characteristics to improve the accuracy of survey estimates. There are also databases of national registries. A registry collects detailed information about a set of patients, such as their age, race and ethnicity, sex, diagnosis, and treatments. One example is the Surveillance, Epidemiology, and End Results (SEER) program, which coalesces data from cancer registries and is funded by the National Cancer Institute (NCI).7National Cancer Institute, National Institutes of HealthOverview of the SEER Program. NCI’s Division of Cancer Control and Population Sciences.http://seer.cancer.gov/about/overview.htmlDate accessed: March 13, 2022Google Scholar SEER includes patient demographics as well as cancer incidence, mortality, tumor stage, and morphology. For example, Hur et al8Hur C. Miller M. Kong C.Y. et al.Trends in esophageal adenocarcinoma incidence and mortality.Cancer. 2013; 119: 1149-1158Crossref PubMed Scopus (353) Google Scholar found an increase in esophageal adenocarcinoma incidence and mortality rates from 1975 to 2009. To perform analyses, researchers must use the free software, SEER∗STAT. While this may be a barrier, there are free tutorials available on the SEER website and a robust and responsive helpline. There are also software applications that use SEER∗STAT output and expand the type of statistical tests that can be done, such as JoinPoint for trend analysis. While SEER is publicly available, certain demographic and cancer data is restricted and is only available via SEER Research Plus, which requires an application.7National Cancer Institute, National Institutes of HealthOverview of the SEER Program. NCI’s Division of Cancer Control and Population Sciences.http://seer.cancer.gov/about/overview.htmlDate accessed: March 13, 2022Google Scholar Another registry is the GI Quality Improvement Consortium (GIQuIC), which is intended as a repository for quality improvement measures.9GIQuIC Research Overview.https://giquic.gi.org/research.aspDate accessed: March 13, 2022Google Scholar This registry includes patient demographics, provider characteristics, American Society of Anesthesiologists status, anticoagulation use, and endoscopic procedure measures beginning in 2010 from provider practices, ambulatory surgical centers, endoscopy suites, and hospitals. A recent study used this database to describe polyps and neoplasia in patients aged 45 to 49, which in light of recently changing colon cancer screening guidelines is important to informing adenoma detection rates for this age group.10Bilal M. Holub J. Greenwald D. et al.Adenoma detection rates in 45-49 year old persons undergoing screening colonoscopy: analysis from the GIQuIC Registry.Am J Gastroenterol. 2022; 117: 806-808Crossref PubMed Scopus (2) Google Scholar,11Trivedi PD, Mohapatra A, Morris MK, et al. Prevalence and predictors of young-onset colorectal neoplasia: insights from a nationally representative colonoscopy registry. 2022;162(4):1136–1146.e5Google Scholar Electronic health records can serve as a research tool that has the benefit of providing a more complete representation of patient care. Records can be obtained from local hospital systems or from large health systems such as Kaiser or Geisinger. The benefit of local data is it may be accessible from your own institution, and the patient population will be familiar to you. The downsides of electronic health record data, however, are that the results may not be generalizable outside of a particular geographic area, and the data may be hard to access and may not be immediately user-friendly. Additionally, health records are by definition retrospective and only allow for observational data collection. One notable example is the Veteran Health Affairs (VA) records, which is a large database of electronic health records that can be accessed with collaboration with VA staff. The VA record has been in place for several decades, providing a unique longitudinal data set. The records include diagnostic and procedural codes, laboratory data, vital signs, imaging, and pathology. One example is a study published in Hepatology in 2021 that described clinical characteristics and outcomes of veterans with and without cirrhosis who tested positive for severe acute respiratory syndrome coronavirus 2.12Ioannou G.N. Liang P.S. Locke E. et al.Cirrhosis and severe acute respiratory syndrome coronavirus 2 infection in US Veterans: risk of infection, hospitalization, ventilation, and mortality.Hepatology. 2021; 74: 322-335Crossref PubMed Scopus (21) Google Scholar Because of the large population, the authors were able to quickly publish a study at the start of the pandemic including 3306 patients with cirrhosis of 88,747 tested for coronavirus disease. However, because the veterans are often English-speaking and male, findings from these studies may not be generalizable,13Kumar S. Metz D.C. Kaplan D.E. et al.Seroprevalence of Helicobacter pylori infection in a national cohort of veterans with noncardia gastric adenocarcinoma.Clin Gastroenterol Hepatol. 2020; 18: 1235-1237.e1231Abstract Full Text Full Text PDF PubMed Scopus (6) Google Scholar and information may be missing if care was provided outside of the VA system. Data from insurance claims are a robust means of conducting research and can provide information about how care is provided “in the real world.” Data often include patient and provider information, diagnostic and procedural codes, and costs of care. These data sets often lack clinically important information such as laboratory data or social and family history. Another limitation is that it is beholden to provider reporting, which may not be entirely accurate, although it tends to be accurate for procedures (eg, endoscopy and colonoscopy).14Cooper G.S. Virnig B. Klabunde C.N. et al.Use of SEER-Medicare data for measuring cancer surgery.Med Care. 2002; 40: IV43-IV48Google Scholar,15Warren J.L. Harlan L.C. Fahey A. et al.Utility of the SEER-Medicare Data to Identify Chemotherapy Use.Med Care. 2002; 40: IV55-IV61Google Scholar One example is MarketScan, which is a family of databases that include insurance claims from participating providers for their employees and employee dependents. Specifically, the claims data comes from employer-sponsored insurance, employer-sponsored Medicare supplement, and Medicaid in 11 states for inpatient, outpatient, and prescription drug claims, as well as expenditure data. Supplemental data include workplace and disability measures, weather pattern, benefit plan design, and inpatient drug use. Kulaylat et al16Kulaylat A.S. Kulaylat A.N. Schaefer E.W. et al.Association of preoperative anti-tumor necrosis factor therapy with adverse postoperative outcomes in patients undergoing abdominal surgery for ulcerative colitis.JAMA Surg. 2017; 152e171538Crossref Scopus (45) Google Scholar used the MarketScan database to evaluate postoperative complications of patients with ulcerative colitis who received preoperative anti-tumor necrosis factor therapy.16Kulaylat A.S. Kulaylat A.N. Schaefer E.W. et al.Association of preoperative anti-tumor necrosis factor therapy with adverse postoperative outcomes in patients undergoing abdominal surgery for ulcerative colitis.JAMA Surg. 2017; 152e171538Crossref Scopus (45) Google Scholar Some limitations of this particular data set are incomplete follow up if patients change employers and the inclusion of only employed patients who are working age. Another notable database is the SEER-Medicare linkage, which combines cancer data from SEER with Medicare claims.17Naational Cancer InstituteSEER-Medicare Linked Database.https://healthcaredelivery.cancer.gov/seermedicare/Date accessed: March 13, 2022Google Scholar This linkage allows for cancer research that includes measures of comorbidities, receipt of screening and evaluation tests, and detailed treatment data. For example, Rustgi et al18Rustgi S.D. Zylberberg H.M. Amin S. et al.Use of endoscopic ultrasound for pancreatic cancer from 2000 to 2016.Endosc Int Open. 2021; 09: E1-E11Google Scholar found a rise in use of endoscopic ultrasound to diagnose patients with pancreatic cancer during 2000 through 2015 as well as a survival benefit. SEER-Medicare also provides a random 5% sample of patients without cancer to serve as controls so that researchers can conduct population-based analyses within the SEER registry areas.17Naational Cancer InstituteSEER-Medicare Linked Database.https://healthcaredelivery.cancer.gov/seermedicare/Date accessed: March 13, 2022Google Scholar Specific limitations include inclusion of only older patients, cost of data purchase, and an extensive application and data use agreement that needs to be approved by NCI. Unless there is a researcher who already has the database, the time from application to data in hand can be many months. Finally the National Inpatient Sample is a comprehensive database consisting of national and state-specific data on inpatient stays, ambulatory surgery, and readmissions.19Agency for Healthcare Research and QualityOverview of the National (Nationwide) Inpatient Sample.https://www.hcup-us.ahrq.gov/nisoverview.jspDate accessed: March 13, 2022Google Scholar It is the largest publicly available all-payer inpatient database designed to produce estimates of inpatient use, access, cost, quality, and outcomes and represents 7 million inpatient stays per year or 35 million per year with weighting. It includes patient demographic and hospital characteristic, International Classification of Diseases codes, charges, discharge status, length of stay, and severity and comorbidity measures. This database was used by Joo et al20Joo M.K. Yoo J.W. Mojtahedi Z. et al.Ten-year trends of utilizing palliative care and palliative procedures in patients with gastric Cancer in the United States from 2009 to 2018—a nationwide database study.BMC Health Serv Res. 2022; 22: 20Crossref PubMed Scopus (1) Google Scholar to describe hospital costs and use of palliative care consults and procedures for patients with gastric cancer. The All of Us Registry is a unique database that is a rich source of data for researchers.21National Institutes of HealthAll of Us Research Program.https://allofus.nih.gov/Date accessed: March 13, 2022Google Scholar This prospective registry includes survey data at the time of entry and biosamples for future genomic studies and is also linked to health records for a subset of patients. The goal of this registry is to promote health equity, improve wellness and health outcomes, and inform earlier diagnosis of disease in a diverse population of patients. A published study by Renedo et al22Renedo D. Acosta J.N. Sujijantarat N. et al.Carotid artery disease among broadly defined underrepresented groups: the All of Us Research Program.Stroke. 2022; 53: e88-e89Crossref PubMed Scopus (1) Google Scholar used novel definitions of underrepresented groups, besides race and ethnicity, to demonstrate disparities in revascularization after stroke. This registry may be subject to bias as all participants volunteer to participate. Our review of the many types of databases that can be used for GI-focused research will hopefully provide the reader with some ideas and help “jump-start” a clinical research career. Although some of these databases can cost as much as several thousand dollars, others are free. Collaboration with other fellows and junior faculty can offset the cost and lead to more fruitful research endeavors.