scholarly journals Histopathological Images and Multi-Omics Integration Predict Molecular Characteristics and Survival in Lung Adenocarcinoma

Author(s):  
Linyan Chen ◽  
Hao Zeng ◽  
Yu Xiang ◽  
Yeqian Huang ◽  
Yuling Luo ◽  
...  

Histopathological images and omics profiles play important roles in prognosis of cancer patients. Here, we extracted quantitative features from histopathological images to predict molecular characteristics and prognosis, and integrated image features with mutations, transcriptomics, and proteomics data for prognosis prediction in lung adenocarcinoma (LUAD). Patients obtained from The Cancer Genome Atlas (TCGA) were divided into training set (n = 235) and test set (n = 235). We developed machine learning models in training set and estimated their predictive performance in test set. In test set, the machine learning models could predict genetic aberrations: ALK (AUC = 0.879), BRAF (AUC = 0.847), EGFR (AUC = 0.855), ROS1 (AUC = 0.848), and transcriptional subtypes: proximal-inflammatory (AUC = 0.897), proximal-proliferative (AUC = 0.861), and terminal respiratory unit (AUC = 0.894) from histopathological images. Moreover, we obtained tissue microarrays from 316 LUAD patients, including four external validation sets. The prognostic model using image features was predictive of overall survival in test and four validation sets, with 5-year AUCs from 0.717 to 0.825. High-risk and low-risk groups stratified by the model showed different survival in test set (HR = 4.94, p < 0.0001) and three validation sets (HR = 1.64–2.20, p < 0.05). The combination of image features and single omics had greater prognostic power in test set, such as histopathology + transcriptomics model (5-year AUC = 0.840; HR = 7.34, p < 0.0001). Finally, the model integrating image features with multi-omics achieved the best performance (5-year AUC = 0.908; HR = 19.98, p < 0.0001). Our results indicated that the machine learning models based on histopathological image features could predict genetic aberrations, transcriptional subtypes, and survival outcomes of LUAD patients. The integration of histopathological images and multi-omics may provide better survival prediction for LUAD.

2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2020 ◽  
Vol 42 (3) ◽  
pp. 135-147 ◽  
Author(s):  
Michael Behr ◽  
Saba Saiel ◽  
Valerie Evans ◽  
Dinesh Kumbhare

Fibromyalgia (FM) diagnosis remains a challenge for clinicians due to a lack of objective diagnostic tools. One proposed solution is the use of quantitative ultrasound (US) techniques, such as image texture analysis, which has demonstrated discriminatory capabilities with other chronic pain conditions. From this, we propose the use of image texture variables to construct and compare two machine learning models (support vector machine [SVM] and logistic regression) for differentiating between the trapezius muscle in healthy and FM patients. US videos of the right and left trapezius muscle were acquired from healthy ( n = 51) participants and those with FM ( n = 57). The videos were converted into 64,800 skeletal muscle regions of interest (ROIs) using MATLAB. The ROIs were filtered by an algorithm using the complex wavelet structural similarity index (CW-SSIM), which removed ROIs that were similar. Thirty-one texture variables were extracted from the ROIs, which were then used in nested cross-validation to construct SVM and elastic net regularized logistic regression models. The generalized performance accuracy of both models was estimated and confirmed with a final validation on a holdout test set. The predicted generalized performance accuracy of the SVM and logistic regression models was computed to be 83.9 ± 2.6% and 65.8 ± 1.7%, respectively. The models achieved accuracies of 84.1%, and 66.0% on the final holdout test set, validating performance estimates. Although both machine learning models differentiate between healthy trapezius muscle and that of patients with FM, only the SVM model demonstrated clinically relevant performance levels.


Author(s):  
Yu. S. Fedorenko

The statistical testing technique is considered to compare the metrics values of machine learning models on a test set. Since the values of metrics depend not only on the models, but also on the data, it may turn out that different models are the best on different test sets. For this reason, the traditional approach to comparing the values of metrics on a test set is often not enough. Sometimes a statistical comparison of the results obtained on the basis of cross-validation is used, but in this case it is impossible to guarantee the independence of the obtained measurements, which does not allow the use of the Student's t-test. There are criteria that do not require independent measurements, but they have less power. For additive metrics, a technique is proposed in this paper, when a test sample is divided into N parts, on each of which the values of the metrics are calculated. Since the value on each part is obtained as the sum of independent random variables, according to the central limit theorem, the obtained metrics values on each of the N parts are realizations of the normally distributed random variable. To estimate the required sample size, it is proposed to use normality tests and build quantile– quantile plots. You can then use a modification of the Student's t-test to conduct a statistical test comparing the mean values of the metrics. A simplified approach is also considered, in which confidence intervals are built for the base model. A model whose metric values do not fall into this interval works differently from the base model. This approach reduces the amount of computations needed, however, an experimental analysis of the binary cross-entropy metric for CTR (Click-Through Rate) prediction models showed that it is more rough than the first one.


2020 ◽  
Author(s):  
Kenneth W. Chapman ◽  
Troy E. Gilmore ◽  
Christian D. Chapman ◽  
Mehrube Mehrubeoglu ◽  
Aaron R. Mittelstet

Abstract. Time-lapse imagery of streams and rivers provide new qualitative insights into hydrologic conditions at stream gauges, especially when site visits are biased toward baseflow or fair-weather conditions. Imagery from fixed, ground-based cameras is also rich in quantitative information that can improve streamflow monitoring. For instance, time-lapse imagery may be valuable for filling data gaps when sensors fail and/or during lapses in funding for monitoring programs. In this study, we automated the analysis of time-lapse imagery from a single camera at a single location, then built and tested machine learning models using programmatically calculable scalar image features to fill data gaps in stream gauge records. Time-lapse images were taken with a fixed, ground-based camera that is part of a documentary watershed imaging project (https://plattebasintimelapse.com/). Features were extracted from 40,000+ daylight images taken at one-hour intervals from 2012 to 2019. The algorithms removed dawn and dusk images that were too dark for feature extraction. The image features were merged with United States Geological Survey (USGS) stage and discharge data (i.e., response variables) from the site based on image capture times and USGS timestamps. We then developed a workflow to identify a suitable feature set to build machine learning models with a randomly selected training set of 30 % of the images with the remaining 70 % for a test set. Predictions were generated from Multi-layer Perceptron (MLP), Random Forest Regression (RFR), and Support Vector Regression (SVR) models. A Kalman filter was applied to the predictions to remove noise. Error metrics were calculated, including Nash-Sutcliffe Efficiency (NSE), Prediction Bias (PBIAS), RMSE-Standard Deviation Ratio (RSR), and an alternative metric that accounted for seasonal runoff. After suitable features were identified, the dataset was divided into test sets of simulated data gaps for 2015, 2016, and 2017. The training sets for each gap were features from contiguous images and sensor readings before and after the gaps. NSE for the year-long gap predictions ranged from 0.63 to 0.90 for discharge and 0.47 to 0.90 for stage. The predictions for 2015 and 2017 displayed lower prediction errors than for 2016. The 2016 discharge was significantly higher than training data, which could explain the poorer performance. First and second half-year test sets were created for 2016 along with MLP models from before/after training sets for each of the gaps that held discharge measurements similar to those in the gaps. The half-year gap models' predictions improved NSE, PBIAS and RSR. The results show it is possible to extract features from images taken with the downstream facing camera to build machine learning models that produce accurate stage and discharge predictions. The methods employed should be transferrable to other sites with ground-based cameras.


2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e14071-e14071
Author(s):  
Romain Goussault ◽  
Cécile Frénard ◽  
Eve Maubec ◽  
Philippe Muller ◽  
Ludovic Martin ◽  
...  

e14071 Background: Machine learning methods are new artificial intelligence tools with promising applications in healthcare. We developed and validated 4 machine learning models to predict the response to immunotherapy and targeted therapy in stage IIIc or IV melanoma patients. Methods: This work was conducted on data from 10 centers participating in the French network for Research and Clinical Investigation on Melanoma (RIC-Mel), launched in 2012. Thus, 935 patients, corresponding to 1978 systemic treatments have been extracted from RIC-Mel database. The following data were considered: age, sex, Breslow, melanoma type, ulceration, spontaneous regression, mitotic index, number of invaded lymph nodes, extracapsular extension, mutational status, melanoma stage, number of metastasis sites, lines of treatments, and time between first melanoma excision and metastatic relapse. Treatment response: complete response, partial response, stable disease, defined as class 1 and progressive disease as class 2. We split this cohort/database into a training set (80%) and test set (20%). The algorithm performances were evaluated on the test set by the percentage of treatments correctly classified in class 1 or 2. Four machine learning algorithms (linear model, random forest, XGBoost and LightGBM) were compared in terms of performance and interpretation for both types of treatments. Results: The accuracies of the best models for immunotherapy (LightGBM) and targeted therapy (random forest) were respectively 66% and 65%. The most significant variables for building the models were respectively: stage (IIIc or IV), response to previous treatments lines, age, number of metastasis sites and time between first melanoma excision and metastatic relapse. Conclusions: We present here the first machine learning models to predict the response to immunotherapy and targeted therapy in stage IIIc or IV melanoma patients. The most predictive variables are coherent with the literature. Future development will include data from 18FDG-PET/CT imaging and other predictive markers recently identified, as circulating DNA to improve the models performance.


CJEM ◽  
2020 ◽  
Vol 22 (S1) ◽  
pp. S18-S19
Author(s):  
L. Grant ◽  
X. Xue ◽  
Z. Vajihi ◽  
A. Azuelos ◽  
S. Rosenthal ◽  
...  

Introduction: Emergency department (ED) crowding is a major problem across Canada. We studied the ability of artificial intelligence methods to improve patient flow through the ED by predicting patient disposition using information available at triage and shortly after patients’ arrival in the ED. Methods: This retrospective study included all visits to an urban, academic, adult ED between May 2012 and June 2019. For each visit, 489 variables were extracted including triage data that had been collected for use in the Canadian Triage Assessment Scale (CTAS) and information regarding laboratory tests, radiological tests, consultations and admissions. A training set consisting of all visits from April 2012 up to December 2018 was used to train 5 classes of machine learning models to predict admission to the hospital from the ED. The models were trained to predict admission at the time of the patient's arrival in the ED and every 30 minutes after arrival until 6 hours into their ED stay. The performance of models was compared using the area under the ROC curve (AUC) on a test set consisting of all visits from January 2019 to June 2019. Results: The study included 536,332 visits and the admission rate was 15.0%. Gradient boosting models generally outperformed other machine learning models. A gradient boosting model using all available data at 2 hours after patient arrival in the ED yielded a test set AUC 0.92 [95% CI 0.91-0.93], while a model using only data available at triage yielded an AUC 0.90 [95% CI 0.89-0.91]. The quality of predictions generally improved as predictions were made later in the patient's ED stay leading to an AUC 0.95 [95% CI 0.93-0.96] at 6 hours after arrival. A gradient boosting model with 20 variables available at 2 hours after patient arrival in the ED yielded an AUC 0.91 [95% CI 0.89-0.93]. A gradient boosting model that makes predictions at 2 hours after arrival in ED using only variables that are available at all EDs in the province of Quebec yielded an AUC 0.91 [95% 0.89-0.92]. Conclusion: Machine learning can predict admission to a hospital from the ED using variables that area collected as part of routine ED care. Machine learning tools may potentially be used to help ED physicians to make faster and more appropriate disposition decisions, to decrease unnecessary testing and alleviate ED crowding.


2020 ◽  
Author(s):  
Adrián Mosquera Orgueira ◽  
José Ángel Díaz Arias ◽  
Miguel Cid López ◽  
Andres Peleteiro Raindo ◽  
Beatriz Antelo Rodriguez ◽  
...  

Abstract Background30-40% of patients with Diffuse Large B-cell Lymphoma (DLBCL) have an adverse clinical evolution. The increased understanding of DLBCL biology has shed light on the clinical evolution of this pathology, leading to the discovery of prognostic factors based on gene expression data, genomic rearrangements and mutational subgroups. Nevertheless, additional efforts are needed in order to enable survival predictions at the patient level. This study investigated new machine learning models of survival based on transcriptomic and clinical data.MethodsGene expression profiling (GEP) in 2 different publicly available retrospective cohorts were analyzed. Cox regression and unsupervised clustering were performed in order to identify probes associated with overall survival on the largest cohort. Random forests were created to model survival using combinations of GEP data, COO classification and clinical information. Cross-validation was used to compare model results in the training set, and Harrel’s concordance index (c-index) was used to assess model’s predictability. Results were validated in an independent test set. Results233 and 64 patients were included in the training and test set, respectively. Initially we derived and validated a 4-gene expression clusterization that was independently associated with lower survival in 20% of patients. These genes were TNFRSF9, BIRC3, BCL2L1 and G3BP2. Thereafter, we applied machine-learning models to predict survival. A set of 102 genes was highly predictive of disease outcome, outperforming available clinical information and COO classification. The final best model integrated clinical information, COO classification, 4-gene-based clusterization and 50 gene expression data (training set c-index, 0.8404, test set c-index, 0.7942). ConclusionThis study indicates that modelling DLBCL survival with transcriptomic-based machine learning algorithms can largely outperform other important prognostic variables such as disease stage and COO.


Sign in / Sign up

Export Citation Format

Share Document