Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Link to the publication
Clark-Boucher D, Zhou X, Du J, Liu Y, Needham BL, Smith JA, Mukherjee B. Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons. PLoS Genet. 2023 Nov 7;19(11):e1011022. doi: 10.1371/journal.pgen.1011022. PMID: 37934796; PMCID: PMC10655967.

Abstract

Epigenetic researchers often evaluate DNA methylation as a potential mediator of the effect of social/Environmental Exposures on a health outcome. Modern statistical methods for jointly evaluating many mediators have not been widely adopted. We compare seven methods for high-dimensional mediation analysis with continuous outcomes through both diverse simulations and analysis of DNAm data from a large multi-ethnic Cohort in the United States, while providing an R package for their seamless implementation and adoption. Among the considered choices, the best-performing methods for detecting active mediators in simulations are the Bayesian sparse linear mixed model (BSLMM) and high-dimensional mediation analysis (HDMA); while the preferred methods for estimating the global mediation effect are high-dimensional linear mediation analysis (HILMA) and principal component mediation analysis (PCMA). We provide guidelines for epigenetic researchers on choosing the best method in practice and offer suggestions for future methodological development.

Accuracy and Reliability of Chatbot Responses to Physician Questions

Link to the publication
Goodman RS, Patrinely JR, Stone CA Jr, Zimmerman E, Donald RR, Chang SS, Berkowitz ST, Finn AP, Jahangir E, Scoville EA, Reese TS, Friedman DL, Bastarache JA, van der Heijden YF, Wright JJ, Ye F, Carter N, Alexander MR, Choe JH, Chastain CA, Zic JA, Horst SN, Turker I, Agarwal R, Osmundson E, Idrees K, Kiernan CM, Padmanabhan C, Bailey CE, Schlegel CE, Chambless LB, Gibson MK, Osterman TJ, Wheless LE, Johnson DB.JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.
PMID: 37782499

Abstract

Importance: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. Objective: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information. Design, setting, and participants: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. Main outcomes and measures: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. Results: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). Conclusions and relevance: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Exploratory profiles of phenols, parabens, and per- and Polyfluoroalkyl
substances among NHANES study participants in
association with previous Cancer diagnoses

Link to the publication
Cathey AL, Nguyen VK, Colacino JA, Woodruff TJ, Reynolds P, Aung MT. Exploratory profiles of phenols, parabens, and per- and poly-fluoroalkyl substances among NHANES study participants in association with previous Cancer diagnoses. J Expo Sci Environ Epidemiol. 2023 Sep;33(5):687-698. doi: 10.1038/s41370-023-00601-6. Epub 2023 Sep 18. PMID: 37718377; PMCID: PMC10541322.

Abstract

Background: Some hormonally active cancers have low survival rates, but a large proportion of their incidence remains unexplained. Endocrine disrupting chemicals may affect hormone pathways in the pathology of these cancers. Objective: To evaluate cross-sectional associations between Per- and Polyfluoroalkyl Substances (PFAs), phenols, and parabens and self-reported previous Cancer diagnoses in the National Health and Nutrition Examination Survey (NHANES). Methods: We extracted concentrations of 7 PFAs and 12 phenols/parabens and self-reported diagnoses of melanoma and cancers of the thyroid, breast, ovary, uterus, and prostate in men and women (≥20 years). Associations between previous Cancer diagnoses and an interquartile range increase in exposure biomarkers were evaluated using logistic regression models adjusted for key covariates. We conceptualized race as social construct proxy of structural social factors and examined associations in non-Hispanic Black, Mexican American, and other Hispanic participants separately compared to White participants. Results: Previous melanoma in women was associated with higher PFDE (OR:2.07, 95% CI: 1.25, 3.43), PFNA (OR:1.72, 95% CI: 1.09, 2.73), PFUA (OR:1.76, 95% CI: 1.07, 2.89), BP3 (OR: 1.81, 95% CI: 1.10, 2.96), DCP25 (OR: 2.41, 95% CI: 1.22, 4.76), and DCP24 (OR: 1.85, 95% CI: 1.05, 3.26). Previous ovarian Cancer was associated with higher DCP25 (OR: 2.80, 95% CI: 1.08, 7.27), BPA (OR: 1.93, 95% CI: 1.11, 3.35) and BP3 (OR: 1.76, 95% CI: 1.00, 3.09). Previous uterine Cancer was associated with increased PFNA (OR: 1.55, 95% CI: 1.03, 2.34), while higher ethyl paraben was inversely associated (OR: 0.31, 95% CI: 0.12, 0.85). Various PFAs were associated with previous ovarian and uterine cancers in White women, while MPAH or BPF was associated with previous breast Cancer among non-White women. Impact statement: Biomarkers across all exposure categories (phenols, parabens, and per- and poly- fluoroalkyl substances) were cross-sectionally associated with increased odds of previous melanoma diagnoses in women, and increased odds of previous ovarian Cancer was associated with several phenols and parabens. Some associations differed by racial group, which is particularly impactful given the established racial disparities in distributions of exposure to these chemicals. This is the first epidemiological study to investigate exposure to phenols in relation to previous Cancer diagnoses, and the first NHANES study to explore racial/ethnic disparities in associations between environmental phenol, paraben, and PFAs exposures and historical Cancer diagnosis.

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations

Link to the publication

Gu T, Taylor JMG, Mukherjee B. A synthetic data integration framework to leverage external summary-level information from heterogeneous populations. Biometrics. 2023 Mar 6. doi: 10.1111/biom.13852. Epub ahead of print. PMID: 36876883.

Abstract

There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate Cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

 

Keywords: data integration; prediction models; stacked multiple imputation; synthetic data.

Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons

Link to the publication

Du J, Zhou X, Clark-Boucher D, Hao W, Liu Y, Smith JA, Mukherjee B. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet Epidemiol. 2023 Mar;47(2):167-184. doi: 10.1002/gepi.22510. Epub 2022 Dec 8. PMID: 36465006.

Operationalizing the Exposome Using Passive Silicone Samplers

Link to the publication

Fuentes ZC, Schwartz YL, Robuck AR, Walker DI. Operationalizing the Exposome Using Passive Silicone Samplers. Curr Pollut Rep. 2022;8(1):1-29. doi: 10.1007/s40726-021-00211-6. Epub 2022 Jan 4. PMID: 35004129; PMCID: PMC8724229.

Abstract

Mediation hypothesis testing for a large number of mediators is challenging due to the composite structure of the null hypothesis, 0 (  : effect of the exposure on the mediator after adjusting for confounders; : effect of the mediator on the outcome after adjusting for exposure and confounders). In this paper, we reviewed three classes of methods for large-scale one at a time mediation hypothesis testing. These methods are commonly used for continuous outcomes and continuous mediators assuming there is no exposure-mediator interaction so that the product  has a causal interpretation as the indirect effect. The first class of methods ignores the impact of different structures under the composite null hypothesis, namely, (1) (2) ; and (3) . The second class of methods weights the reference distribution under each case of the null to form a mixture reference distribution. The third class constructs a composite test statistic using the three p values obtained under each case of the null so that the reference distribution of the composite statistic is approximately . In addition to these existing methods, we developed the Sobel-comp method belonging to the second class, which uses a corrected mixture reference distribution for Sobel’s test statistic. We performed extensive simulation studies to compare all six methods belonging to these three classes in terms of the false positive rates (FPRs) under the null hypothesis and the true positive rates under the alternative hypothesis. We found that the second class of methods which uses a mixture reference distribution could best maintain the FPRs at the nominal level under the null hypothesis and had the greatest true positive rates under the alternative hypothesis. We applied all methods to study the mediation mechanism of DNA methylation sites in the pathway from adult socioeconomic status to glycated hemoglobin level using data from the Multi-Ethnic Study of Atherosclerosis (MESA). We provide guidelines for choosing the optimal mediation hypothesis testing method in practice and develop an R package medScan available on the CRAN for implementing all the six methods.

 

Keywords: agnostic mediation analysis; composite null hypothesis; indirect effect; mediation effect; multiple hypothesis testing.

Abstract

The exposome, which is defined as the cumulative effect of Environmental Exposures and corresponding biological responses, aims to provide a comprehensive measure for evaluating non-genetic causes of disease. Operationalization of the exposome for environmental health and precision medicine has been limited by the lack of a universal approach for characterizing complex exposures, particularly as they vary temporally and geographically. To overcome these challenges, passive sampling devices (PSDs) provide a key measurement strategy for deep exposome phenotyping, which aims to provide comprehensive chemical assessment using untargeted high-resolution mass spectrometry for exposome-wide association studies. To highlight the advantages of silicone PSDs, we review their use in population studies and evaluate the broad range of applications and chemical classes characterized using these samplers. We assess key aspects of incorporating PSDs within observational studies, including the need to preclean samplers prior to use to remove impurities that interfere with compound detection, analytical considerations, and cost. We close with strategies on how to incorporate measures of the external exposome using PSDs, and their advantages for reducing variability in exposure measures and providing a more thorough accounting of the exposome. Continued development and application of silicone PSDs will facilitate greater understanding of how Environmental Exposures drive disease risk, while providing a feasible strategy for incorporating untargeted, high-resolution characterization of the external exposome in human studies.

 

Keywords: Exposome; Exposure assessment; High-resolution mass spectrometry; Precision medicine; Silicone wristband samplers.