To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice

Link to the publication
Salvatore M, Kundu R, Shi X, Friese CR, Lee S, Fritsche LG, Mondul AM, Hanauer D, Pearce CL, Mukherjee B. To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice. J Am Med Inform Assoc. 2024 Jun 20;31(7):1479-1492. doi: 10.1093/jamia/ocae098. PMID: 38742457; PMCID: PMC11187425.

Abstract

Objectives: To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data. Materials and methods: We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying Recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results. Results: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal Cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal Cancer to align more closely with national registry-based estimates. Discussion: Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis. Conclusion: EHR-linked biobanks should report Recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

 

Keywords: ICD codes; biobank; electronic health records; phenome; selection bias.

Harmonized US National Health and Nutrition Examination Survey 1988-2018 for high throughput exposome-health discovery

Link to the publication
Nguyen VK, Middleton LYM, Huang L, Zhao N, Verly E Jr, Kvasnicka J, Sagers L, Patel CJ, Colacino J, Jolliet O. Harmonized US National Health and Nutrition Examination Survey 1988-2018 for high throughput exposome-health discovery. medRxiv [Preprint]. 2023 Feb 8:2023.02.06.23284573. doi: 10.1101/2023.02.06.23284573. PMID: 36798185; PMCID: PMC9934713.

Abstract

The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases. Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis and secular trends of the exposome.

A nested case-control study of untargeted plasma Metabolomics and lung Cancer among never-smoking women within the prospective Shanghai Women’s Health Study

Link to the publication
Rahman ML, Shu XO, Jones DP, Hu W, Ji BT, Blechter B, Wong JYY, Cai Q, Yang G, Gao YT, Zheng W, Rothman N, Walker D, Lan Q. A nested case-control study of untargeted plasma Metabolomics and lung Cancer among never-smoking women within the prospective Shanghai Women’s Health Study. Int J Cancer. 2024 Apr 23. doi: 10.1002/ijc.34929. Epub ahead of print. PMID: 38651675.

Abstract

The etiology of lung Cancer in never-smokers remains elusive, despite 15% of lung Cancer cases in men and 53% in women worldwide being unrelated to smoking. Here, we aimed to enhance our understanding of lung Cancer pathogenesis among never-smokers using untargeted Metabolomics. This nested case-control study included 395 never-smoking women who developed lung Cancer and 395 matched never-smoking Cancer-free women from the prospective Shanghai Women’s Health Study with 15,353 metabolic features quantified in pre-diagnostic plasma using liquid chromatography high-resolution mass spectrometry. Recognizing that metabolites often correlate and seldom act independently in biological processes, we utilized a weighted correlation network analysis to agnostically construct 28 network modules of correlated metabolites. Using conditional logistic regression models, we assessed the associations for both metabolic network modules and individual metabolic features with lung Cancer, accounting for multiple testing using a false discovery rate (FDR) < 0.20. We identified a network module of 121 features inversely associated with all lung Cancer (p = .001, FDR = 0.028) and lung adenocarcinoma (p = .002, FDR = 0.056), where lyso-glycerophospholipids played a key role driving these associations. Another module of 440 features was inversely associated with lung adenocarcinoma (p = .014, FDR = 0.196). Individual metabolites within these network modules were enriched in biological pathways linked to oxidative stress, and energy metabolism. These pathways have been implicated in previous Metabolomics studies involving populations exposed to known lung Cancer risk factors such as traffic-related air pollution and polycyclic aromatic hydrocarbons. Our results suggest that untargeted plasma Metabolomics could provide novel insights into the etiology and risk factors of lung Cancer among never-smokers.

Keywords: lung Cancer; Metabolomics; network analysis; never‐smokers; oxidative stress.

© 2024 UICC. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.

Comparative impact assessment of COVID-19 policy interventions in five South Asian countries using reported and estimated unreported death counts during 2020-2021

Link to the publication
Kundu R, Datta J, Ray D, Mishra S, Bhattacharyya R, Zimmermann L, Mukherjee B. Comparative impact assessment of COVID-19 policy interventions in five South Asian countries using reported and estimated unreported death counts during 2020-2021. PLOS Glob Public Health. 2023 Dec 27;3(12):e0002063. doi: 10.1371/journal.pgph.0002063. PMID: 38150465; PMCID: PMC10752546.

Abstract

There has been raging discussion and debate around the quality of COVID death data in South Asia. According to WHO, of the 5.5 million reported COVID-19 deaths from 2020-2021, 0.57 million (10%) were contributed by five low and middle income countries (LMIC) countries in the Global South: India, Pakistan, Bangladesh, Sri Lanka and Nepal. However, a number of excess death estimates show that the actual death toll from COVID-19 is significantly higher than the reported number of deaths. For example, the IHME and WHO both project around 14.9 million total deaths, of which 4.5-5.5 million were attributed to these five countries in 2020-2021. We focus our gaze on the COVID-19 performance of these five countries where 23.5% of the world population lives in 2020 and 2021, via a counterfactual lens and ask, to what extent the mortality of one LMIC would have been affected if it adopted the pandemic policies of another, similar country? We use a Bayesian semi-mechanistic model developed by Mishra et al. (2021) to compare both the reported and estimated total death tolls by permuting the time-varying reproduction number (Rt) across these countries over a similar time period. Our analysis shows that, in the first half of 2021, mortality in India in terms of reported deaths could have been reduced to 96 and 102 deaths per million compared to actual 170 reported deaths per million had it adopted the policies of Nepal and Pakistan respectively. In terms of total deaths, India could have averted 481 and 466 deaths per million had it adopted the policies of Bangladesh and Pakistan. On the other hand, India had a lower number of reported COVID-19 deaths per million (48 deaths per million) and a lower estimated total deaths per million (80 deaths per million) in the second half of 2021, and LMICs other than Pakistan would have lower reported mortality had they followed India’s strategy. The gap between the reported and estimated total deaths highlights the varying level and extent of under-reporting of deaths across the subcontinent, and that model estimates are contingent on accuracy of the death data. Our analysis shows the importance of timely public health intervention and vaccines for lowering mortality and the need for better coverage infrastructure for the death registration system in LMICs.

 

Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Link to the publication
Clark-Boucher D, Zhou X, Du J, Liu Y, Needham BL, Smith JA, Mukherjee B. Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons. PLoS Genet. 2023 Nov 7;19(11):e1011022. doi: 10.1371/journal.pgen.1011022. PMID: 37934796; PMCID: PMC10655967.

Abstract

Epigenetic researchers often evaluate DNA methylation as a potential mediator of the effect of social/Environmental Exposures on a health outcome. Modern statistical methods for jointly evaluating many mediators have not been widely adopted. We compare seven methods for high-dimensional mediation analysis with continuous outcomes through both diverse simulations and analysis of DNAm data from a large multi-ethnic Cohort in the United States, while providing an R package for their seamless implementation and adoption. Among the considered choices, the best-performing methods for detecting active mediators in simulations are the Bayesian sparse linear mixed model (BSLMM) and high-dimensional mediation analysis (HDMA); while the preferred methods for estimating the global mediation effect are high-dimensional linear mediation analysis (HILMA) and principal component mediation analysis (PCMA). We provide guidelines for epigenetic researchers on choosing the best method in practice and offer suggestions for future methodological development.

Copyright: © 2023 Clark-Boucher et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Accuracy and Reliability of Chatbot Responses to Physician Questions

Link to the publication
Goodman RS, Patrinely JR, Stone CA Jr, Zimmerman E, Donald RR, Chang SS, Berkowitz ST, Finn AP, Jahangir E, Scoville EA, Reese TS, Friedman DL, Bastarache JA, van der Heijden YF, Wright JJ, Ye F, Carter N, Alexander MR, Choe JH, Chastain CA, Zic JA, Horst SN, Turker I, Agarwal R, Osmundson E, Idrees K, Kiernan CM, Padmanabhan C, Bailey CE, Schlegel CE, Chambless LB, Gibson MK, Osterman TJ, Wheless LE, Johnson DB.JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.
PMID: 37782499

Abstract

Importance: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. Objective: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information. Design, setting, and participants: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. Main outcomes and measures: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. Results: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). Conclusions and relevance: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Exploratory profiles of phenols, parabens, and per- and Polyfluoroalkyl
substances among NHANES study participants in
association with previous Cancer diagnoses

Link to the publication
Cathey AL, Nguyen VK, Colacino JA, Woodruff TJ, Reynolds P, Aung MT. Exploratory profiles of phenols, parabens, and per- and poly-fluoroalkyl substances among NHANES study participants in association with previous Cancer diagnoses. J Expo Sci Environ Epidemiol. 2023 Sep;33(5):687-698. doi: 10.1038/s41370-023-00601-6. Epub 2023 Sep 18. PMID: 37718377; PMCID: PMC10541322.

Abstract

Background: Some hormonally active cancers have low survival rates, but a large proportion of their incidence remains unexplained. Endocrine disrupting chemicals may affect hormone pathways in the pathology of these cancers. Objective: To evaluate cross-sectional associations between Per- and Polyfluoroalkyl Substances (PFAS), phenols, and parabens and self-reported previous Cancer diagnoses in the National Health and Nutrition Examination Survey (NHANES). Methods: We extracted concentrations of 7 PFAS and 12 phenols/parabens and self-reported diagnoses of melanoma and cancers of the thyroid, breast, ovary, uterus, and prostate in men and women (≥20 years). Associations between previous Cancer diagnoses and an interquartile range increase in exposure biomarkers were evaluated using logistic regression models adjusted for key covariates. We conceptualized race as social construct proxy of structural social factors and examined associations in non-Hispanic Black, Mexican American, and other Hispanic participants separately compared to White participants. Results: Previous melanoma in women was associated with higher PFDE (OR:2.07, 95% CI: 1.25, 3.43), PFNA (OR:1.72, 95% CI: 1.09, 2.73), PFUA (OR:1.76, 95% CI: 1.07, 2.89), BP3 (OR: 1.81, 95% CI: 1.10, 2.96), DCP25 (OR: 2.41, 95% CI: 1.22, 4.76), and DCP24 (OR: 1.85, 95% CI: 1.05, 3.26). Previous ovarian Cancer was associated with higher DCP25 (OR: 2.80, 95% CI: 1.08, 7.27), BPA (OR: 1.93, 95% CI: 1.11, 3.35) and BP3 (OR: 1.76, 95% CI: 1.00, 3.09). Previous uterine Cancer was associated with increased PFNA (OR: 1.55, 95% CI: 1.03, 2.34), while higher ethyl paraben was inversely associated (OR: 0.31, 95% CI: 0.12, 0.85). Various PFAS were associated with previous ovarian and uterine cancers in White women, while MPAH or BPF was associated with previous breast Cancer among non-White women. Impact statement: Biomarkers across all exposure categories (phenols, parabens, and per- and poly- fluoroalkyl substances) were cross-sectionally associated with increased odds of previous melanoma diagnoses in women, and increased odds of previous ovarian Cancer was associated with several phenols and parabens. Some associations differed by racial group, which is particularly impactful given the established racial disparities in distributions of exposure to these chemicals. This is the first epidemiological study to investigate exposure to phenols in relation to previous Cancer diagnoses, and the first NHANES study to explore racial/ethnic disparities in associations between environmental phenol, paraben, and PFAS exposures and historical Cancer diagnosis.

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations

Link to the publication
Gu T, Taylor JMG, Mukherjee B. A synthetic data integration framework to leverage external summary-level information from heterogeneous populations. Biometrics. 2023 Dec;79(4):3831-3845. doi: 10.1111/biom.13852. Epub 2023 Apr 4. PMID: 36876883; PMCID: PMC10480346.

Abstract

There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate Cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

 

Keywords: data integration; prediction models; stacked multiple imputation; synthetic data.

Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons

Link to the publication

Du J, Zhou X, Clark-Boucher D, Hao W, Liu Y, Smith JA, Mukherjee B. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet Epidemiol. 2023 Mar;47(2):167-184. doi: 10.1002/gepi.22510. Epub 2022 Dec 8. PMID: 36465006.

Operationalizing the Exposome Using Passive Silicone Samplers

Link to the publication

Fuentes ZC, Schwartz YL, Robuck AR, Walker DI. Operationalizing the Exposome Using Passive Silicone Samplers. Curr Pollut Rep. 2022;8(1):1-29. doi: 10.1007/s40726-021-00211-6. Epub 2022 Jan 4. PMID: 35004129; PMCID: PMC8724229.

Abstract

Mediation hypothesis testing for a large number of mediators is challenging due to the composite structure of the null hypothesis, 0 (  : effect of the exposure on the mediator after adjusting for confounders; : effect of the mediator on the outcome after adjusting for exposure and confounders). In this paper, we reviewed three classes of methods for large-scale one at a time mediation hypothesis testing. These methods are commonly used for continuous outcomes and continuous mediators assuming there is no exposure-mediator interaction so that the product  has a causal interpretation as the indirect effect. The first class of methods ignores the impact of different structures under the composite null hypothesis, namely, (1) (2) ; and (3) . The second class of methods weights the reference distribution under each case of the null to form a mixture reference distribution. The third class constructs a composite test statistic using the three p values obtained under each case of the null so that the reference distribution of the composite statistic is approximately . In addition to these existing methods, we developed the Sobel-comp method belonging to the second class, which uses a corrected mixture reference distribution for Sobel’s test statistic. We performed extensive simulation studies to compare all six methods belonging to these three classes in terms of the false positive rates (FPRs) under the null hypothesis and the true positive rates under the alternative hypothesis. We found that the second class of methods which uses a mixture reference distribution could best maintain the FPRs at the nominal level under the null hypothesis and had the greatest true positive rates under the alternative hypothesis. We applied all methods to study the mediation mechanism of DNA methylation sites in the pathway from adult socioeconomic status to glycated hemoglobin level using data from the Multi-Ethnic Study of Atherosclerosis (MESA). We provide guidelines for choosing the optimal mediation hypothesis testing method in practice and develop an R package medScan available on the CRAN for implementing all the six methods.

 

Keywords: agnostic mediation analysis; composite null hypothesis; indirect effect; mediation effect; multiple hypothesis testing.

Abstract

The exposome, which is defined as the cumulative effect of Environmental Exposures and corresponding biological responses, aims to provide a comprehensive measure for evaluating non-genetic causes of disease. Operationalization of the exposome for environmental health and precision medicine has been limited by the lack of a universal approach for characterizing complex exposures, particularly as they vary temporally and geographically. To overcome these challenges, passive sampling devices (PSDs) provide a key measurement strategy for deep exposome phenotyping, which aims to provide comprehensive chemical assessment using untargeted high-resolution mass spectrometry for exposome-wide association studies. To highlight the advantages of silicone PSDs, we review their use in population studies and evaluate the broad range of applications and chemical classes characterized using these samplers. We assess key aspects of incorporating PSDs within observational studies, including the need to preclean samplers prior to use to remove impurities that interfere with compound detection, analytical considerations, and cost. We close with strategies on how to incorporate measures of the external exposome using PSDs, and their advantages for reducing variability in exposure measures and providing a more thorough accounting of the exposome. Continued development and application of silicone PSDs will facilitate greater understanding of how Environmental Exposures drive disease risk, while providing a feasible strategy for incorporating untargeted, high-resolution characterization of the external exposome in human studies.

 

Keywords: Exposome; Exposure assessment; High-resolution mass spectrometry; Precision medicine; Silicone wristband samplers.