Completion of the DrugMatrix Toxicogenomics Database using ToxCompl
Authors:
Guojing Cong1, Robert Patton1, Frank Chao3, Daniel L. Svoboda4, Warren M. Casey3, Charles P. Schmitt3, Charles Murphy2, Jeremy N. Erickson3, Parker A. Combs3, Scott S. Auerbach3
Corresponding Author:
Scott Auerbach
1Data and AI Section, Oak Ridge National Laboratory, Oak Ridge, TN
2Carnegie Mellon University, Pittsburgh, PA
3Division of Translational Toxicology, National Institute of Environmental Health Sciences, RTP, NC
4Sciome LLC, RTP, NC
DOI:
https://doi.org/10.22427/NTP-DATA-500-010-001-000-7
Publication
Abstract
The DrugMatrix Database contains systematically generated toxicogenomics data from short-term in vivo studies for over 600 chemicals. However, most of the potential endpoints in the database are missing due to a lack of experimental measurements. We present our study on leveraging matrix factorization and machine learning methods to predict the missing values in the DrugMatrix, which includes gene expression across eight tissues on two expression platforms along with paired clinical chemistry, hematology, and histopathology measurements. One major challenge we encounter is the skewed distribution of the available measured data, in terms of both tissue sources and values. We propose a method, ToxiCompl, that applies systematic hybrid sampling guided by Bayesian optimization in conjunction with low-rank matrix factorization to recover the missing values. ToxiCompl achieves good training and validation performance from a machine learning perspective.
We further conduct an in-depth validation of the predicted data from biological and toxicological perspectives with a series of analyses. These include examining the connectivity pattern of predicted gene expression responses, characterizing molecular pathway-level responses from sets of differentially expressed genes, evaluating known transcriptional biomarkers of tissue toxicity, and characterizing pre-dicted apical endpoints. Our analysis shows that the predicted differential gene expression, broadly speaking, aligns with what would be anticipated. For example, in most instances, our predicted differentially expressed gene lists offer a connectivity level comparable to that of measured data in connectivity analysis. Using Havcr1, a known transcriptional biomarker of kidney injury, we identify treatments that, based on the predicted expression data, manifest kidney toxicity in a manner that is mechanistically plausible and supported by the literature. Characterization of the predicted clinical chemistry data suggests that strong effects are relatively reliably predicted, while more subtle effects pose a greater challenge. In the case of histopathological prediction, we find a significant overprediction due to positivity bias in the measured data. Developing methods to deal with this bias is one of the areas we plan to target for future improvement. The main advantage of the ToxiCompl approach is that, in the absence of additional experimental data, it drastically extends the toxicogenomic landscape into a number of data-poor tissues, thereby allowing researchers to formulate mechanistic hypotheses about effects in tissues that have been underrepresented in the literature. All measured and predicted DrugMatrix data (i.e., gene expression, clinical chemistry, hematology, and histopathology) are available to the public through an intuitive GUI interface that allows for data retrieval, gene set analysis and high dimensional visualization of gene expression similarity (https://rstudio.niehs.nih.gov/complete_drugmatrix/).
Input Data
Input Dataset to the Matrix Completion Algorithm
The data file listed below is the input dataset to the matrix completion algorithm created by ORNL.
Output Data
Matrix Completion Algorithm Outputs
The following data files are the outputs from the matrix completion algorithm created by ORNL. The files listed below are predictions for the missing hematology, clinical chemistry, and histopathology.
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-C00.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-H00.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-M00.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/C_H_M annotation.txt
Predictions for the Missing Gene Expression Data
The files listed below are the predictions for the missing gene expression data from both Codelink and Affymetrix. This output data is organized into the different tissue types.
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-BM_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-BR_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-HE_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-INTESTINE_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-KI_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-LI_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-SP_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/drugmatrix-coo-TM_.csv.gz
- https://cebs-ext.niehs.nih.gov/cahs/file/download/ornl/ID_Lookup_DS_final_SA.txt
- toxcompl_probe_mapping.csv
Export of a PostgreSQL Database
The file below is an export of a PostgreSQL database. The database was necessary for linking datasets together and is the backend for the application. The application can be used for general exploration of both the previously generated and newly predicted data. It was used to evaluate the performance of the prediction algorithm.
Links
Link to Code
The code for the application written in R and Shiny is provided below:
https://github.com/combspk/Complete-DrugMatrix