scholarly journals DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 778 ◽  
Author(s):  
Liu ◽  
Liu ◽  
Pan ◽  
Li ◽  
Yang ◽  
...  

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Heart ◽  
2018 ◽  
Vol 104 (23) ◽  
pp. 1921-1928 ◽  
Author(s):  
Ming-Zher Poh ◽  
Yukkee Cheung Poh ◽  
Pak-Hei Chan ◽  
Chun-Ka Wong ◽  
Louise Pun ◽  
...  

ObjectiveTo evaluate the diagnostic performance of a deep learning system for automated detection of atrial fibrillation (AF) in photoplethysmographic (PPG) pulse waveforms.MethodsWe trained a deep convolutional neural network (DCNN) to detect AF in 17 s PPG waveforms using a training data set of 149 048 PPG waveforms constructed from several publicly available PPG databases. The DCNN was validated using an independent test data set of 3039 smartphone-acquired PPG waveforms from adults at high risk of AF at a general outpatient clinic against ECG tracings reviewed by two cardiologists. Six established AF detectors based on handcrafted features were evaluated on the same test data set for performance comparison.ResultsIn the validation data set (3039 PPG waveforms) consisting of three sequential PPG waveforms from 1013 participants (mean (SD) age, 68.4 (12.2) years; 46.8% men), the prevalence of AF was 2.8%. The area under the receiver operating characteristic curve (AUC) of the DCNN for AF detection was 0.997 (95% CI 0.996 to 0.999) and was significantly higher than all the other AF detectors (AUC range: 0.924–0.985). The sensitivity of the DCNN was 95.2% (95% CI 88.3% to 98.7%), specificity was 99.0% (95% CI 98.6% to 99.3%), positive predictive value (PPV) was 72.7% (95% CI 65.1% to 79.3%) and negative predictive value (NPV) was 99.9% (95% CI 99.7% to 100%) using a single 17 s PPG waveform. Using the three sequential PPG waveforms in combination (<1 min in total), the sensitivity was 100.0% (95% CI 87.7% to 100%), specificity was 99.6% (95% CI 99.0% to 99.9%), PPV was 87.5% (95% CI 72.5% to 94.9%) and NPV was 100% (95% CI 99.4% to 100%).ConclusionsIn this evaluation of PPG waveforms from adults screened for AF in a real-world primary care setting, the DCNN had high sensitivity, specificity, PPV and NPV for detecting AF, outperforming other state-of-the-art methods based on handcrafted features.


2014 ◽  
Author(s):  
Adrin Jalali ◽  
Nico Pfeifer

Motivation: Molecular measurements from cancer patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples, might not be completely fullfilled. We introduce a method, that can estimate this bias on a per-feature level and incorporate calculated feature confidences into a weighted combination of classifiers with disjoint feature sets. Results: The new method achieves state-of-the-art performance on many different cancer data sets with measured DNA methylation or gene expression. Moreover, we show how to visualize the learned classifiers to find interesting associations with the target label. Applied to a leukemia data set we find several ribosomal proteins associated with leukemia's risk group that might be interesting targets for follow-up studies and support the hypothesis that the ribosomes are a new frontier in gene regulation. Availability: The method is available under GPLv3+ License at https: //github.com/adrinjalali/Network-Classifier.


2006 ◽  
Vol 29 (1) ◽  
pp. 153-162
Author(s):  
Pratul Kumar Saraswati ◽  
Sanjeev V Sabnis

Paleontologists use statistical methods for prediction and classification of taxa. Over the years, the statistical analyses of morphometric data are carried out under the assumption of multivariate normality. In an earlier study, three closely resembling species of a biostratigraphically important genus Nummulites were discriminated by multi-group discrimination. Two discriminant functions that used diameter and thickness of the tests and height and length of chambers in the final whorl accounted for nearly 100% discrimination. In this paper Classification and Regression Tree (CART), a non-parametric method, is used for classification and prediction of the same data set. In all 111 iterations of CART methodology are performed by splitting the data set of 55 observations into training, validation and test data sets in varying proportions. In the validation data sets 40% of the iterations are correctly classified and only one case of misclassification in 49% of the iterations is noted. As regards test data sets, nearly 70% contain no misclassification cases whereas in about 25% test data sets only one case of misclassification is found. The results suggest that the method is highly successful in assigning an individual to a particular species. The key variables on the basis of which tree models are built are combinations of thickness of the test (T), height of the chambers in the final whorl (HL) and diameter of the test (D). Both discriminant analysis and CART thus appear to be comparable in discriminating the three species. However, CART reduces the number of requisite variables without increasing the misclassification error. The method is very useful for professional geologists for quick identification of species.


2018 ◽  
Vol 2018 ◽  
pp. 1-6 ◽  
Author(s):  
Gen-Min Lin ◽  
Mei-Juan Chen ◽  
Chia-Hung Yeh ◽  
Yu-Yang Lin ◽  
Heng-Yu Kuo ◽  
...  

Entropy images, representing the complexity of original fundus photographs, may strengthen the contrast between diabetic retinopathy (DR) lesions and unaffected areas. The aim of this study is to compare the detection performance for severe DR between original fundus photographs and entropy images by deep learning. A sample of 21,123 interpretable fundus photographs obtained from a publicly available data set was expanded to 33,000 images by rotating and flipping. All photographs were transformed into entropy images using block size 9 and downsized to a standard resolution of 100 × 100 pixels. The stages of DR are classified into 5 grades based on the International Clinical Diabetic Retinopathy Disease Severity Scale: Grade 0 (no DR), Grade 1 (mild nonproliferative DR), Grade 2 (moderate nonproliferative DR), Grade 3 (severe nonproliferative DR), and Grade 4 (proliferative DR). Of these 33,000 photographs, 30,000 images were randomly selected as the training set, and the remaining 3,000 images were used as the testing set. Both the original fundus photographs and the entropy images were used as the inputs of convolutional neural network (CNN), and the results of detecting referable DR (Grades 2–4) as the outputs from the two data sets were compared. The detection accuracy, sensitivity, and specificity of using the original fundus photographs data set were 81.80%, 68.36%, 89.87%, respectively, for the entropy images data set, and the figures significantly increased to 86.10%, 73.24%, and 93.81%, respectively (all p values <0.001). The entropy image quantifies the amount of information in the fundus photograph and efficiently accelerates the generating of feature maps in the CNN. The research results draw the conclusion that transformed entropy imaging of fundus photographs can increase the machinery detection accuracy, sensitivity, and specificity of referable DR for the deep learning-based system.


2021 ◽  
pp. 543-551
Author(s):  
Daiju Ueda ◽  
Akira Yamamoto ◽  
Tsutomu Takashima ◽  
Naoyoshi Onoda ◽  
Satoru Noda ◽  
...  

PURPOSE The molecular subtype of breast cancer is an important component of establishing the appropriate treatment strategy. In clinical practice, molecular subtypes are determined by receptor expressions. In this study, we developed a model using deep learning to determine receptor expressions from mammograms. METHODS A developing data set and a test data set were generated from mammograms from the affected side of patients who were pathologically diagnosed with breast cancer from January 2006 through December 2016 and from January 2017 through December 2017, respectively. The developing data sets were used to train and validate the DL-based model with five-fold cross-validation for classifying expression of estrogen receptor (ER), progesterone receptor (PgR), and human epidermal growth factor receptor 2-neu (HER2). The area under the curves (AUCs) for each receptor were evaluated with the independent test data set. RESULTS The developing data set and the test data set included 1,448 images (997 ER-positive and 386 ER-negative, 641 PgR-positive and 695 PgR-negative, and 220 HER2-enriched and 1,109 non–HER2-enriched) and 225 images (176 ER-positive and 40 ER-negative, 101 PgR-positive and 117 PgR-negative, and 53 HER2-enriched and 165 non–HER2-enriched), respectively. The AUC of ER-positive or -negative in the test data set was 0.67 (0.58-0.76), the AUC of PgR-positive or -negative was 0.61 (0.53-0.68), and the AUC of HER2-enriched or non–HER2-enriched was 0.75 (0.68-0.82). CONCLUSION The DL-based model effectively classified the receptor expressions from the mammograms. Applying the DL-based model to predict breast cancer classification with a noninvasive approach would have additive value to patients.


2021 ◽  
Author(s):  
Hye-Won Hwang ◽  
Jun-Ho Moon ◽  
Min-Gyu Kim ◽  
Richard E. Donatelli ◽  
Shin-Jae Lee

ABSTRACT Objectives To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI). Materials and Methods This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR). Results SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures. Conclusions This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.


Author(s):  
Kyungkoo Jun

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2021 ◽  
Author(s):  
David Cotton ◽  

&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;HYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.&lt;/p&gt;&lt;p&gt;New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.&lt;/p&gt;&lt;p&gt;A series of case studies will assess these products in terms of their scientific impacts.&lt;/p&gt;&lt;p&gt;All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&lt;/p&gt;&lt;p&gt;&amp;#160;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objectives&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The scientific objectives of HYDROCOASTAL are to enhance our understanding&amp;#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changes&lt;/p&gt;&lt;p&gt;The technical objectives are to develop and evaluate&amp;#160; new SAR&amp;#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Project&amp;#160; Outline&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;There are four tasks to the project&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters&lt;/li&gt; &lt;li&gt;Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets&lt;/li&gt; &lt;li&gt;Impacts Assessment: The impact of these global products will be assess in a series of Case Studies&lt;/li&gt; &lt;li&gt;Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.&lt;/li&gt; &lt;/ul&gt;&lt;p&gt;&amp;#160;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Presentation&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&lt;/p&gt;&lt;p&gt;&amp;#160;&lt;/p&gt;


2021 ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.


Sign in / Sign up

Export Citation Format

Share Document