From Organized High-Throughput Data to Phenomenological Theory using Machine Learning: The Example of Dielectric Breakdown

Abstract Background Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques. Methods We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers. Results In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified. Conclusions It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What’s more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.

Download Full-text

Understanding protein dispensability through machine-learning analysis of high-throughput data

Bioinformatics ◽

10.1093/bioinformatics/bti058 ◽

2004 ◽

Vol 21 (5) ◽

pp. 575-581 ◽

Cited By ~ 59

Author(s):

Y. Chen ◽

D. Xu

Keyword(s):

Machine Learning ◽

High Throughput ◽

High Throughput Data ◽

Learning Analysis

Download Full-text

Identifying models of dielectric breakdown strength from high-throughput data via genetic programming

Scientific Reports ◽

10.1038/s41598-017-17535-3 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 7

Author(s):

Fenglin Yuan ◽

Tim Mueller

Keyword(s):

Genetic Programming ◽

High Throughput ◽

Dielectric Breakdown ◽

Breakdown Strength ◽

Dielectric Breakdown Strength ◽

High Throughput Data

Download Full-text

Multidimensional Classification of Catalysts in Oxidative Coupling of Methane through Machine Learning and High-Throughput Data

The Journal of Physical Chemistry Letters ◽

10.1021/acs.jpclett.0c01926 ◽

2020 ◽

Vol 11 (16) ◽

pp. 6819-6826

Author(s):

Keisuke Takahashi ◽

Lauren Takahashi ◽

Thanh Nhat Nguyen ◽

Ashutosh Thakur ◽

Toshiaki Taniike

Keyword(s):

Machine Learning ◽

High Throughput ◽

Oxidative Coupling ◽

Oxidative Coupling Of Methane ◽

High Throughput Data

Download Full-text

Faculty Opinions recommendation of Finding disease genes: a fast and flexible approach for analyzing high-throughput data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13277014.14635129 ◽

2011 ◽

Author(s):

Alejandro Schaffer

Keyword(s):

High Throughput ◽

Disease Genes ◽

Flexible Approach ◽

High Throughput Data

Download Full-text

High Throughput Ultrasonic Multi-implant Readout Using a Machine-Learning Assisted CDMA Receiver

2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) ◽

10.1109/embc44109.2020.9176480 ◽

2020 ◽

Author(s):

Sina Faraji Alamouti ◽

Mohammad Meraj Ghanbari ◽

Nathan Tessema Ersumo ◽

Rikky Muller

Keyword(s):

Machine Learning ◽

High Throughput ◽

Cdma Receiver

Download Full-text

Robust and Efficient Parametric Spectral Density Estimation for High-Throughput Data

Technometrics ◽

10.1080/00401706.2021.1884134 ◽

2021 ◽

pp. 1-22

Author(s):

Martin Lysy ◽

Feiyu Zhu ◽

Bryan Yates ◽

Aleksander Labuda

Keyword(s):

Spectral Density ◽

Density Estimation ◽

High Throughput ◽

Spectral Density Estimation ◽

High Throughput Data

Download Full-text

Accelerating organic solar cell material's discovery: high-throughput screening and big data

Energy & Environmental Science ◽

10.1039/d1ee00559f ◽

2021 ◽

Author(s):

Xabier Rodríguez-Martínez ◽

Enrique Pascual-San-José ◽

Mariano Campoy-Quiles

Keyword(s):

Machine Learning ◽

Big Data ◽

High Throughput ◽

Organic Solar Cells ◽

High Throughput Screening ◽

Organic Solar Cell ◽

State Of The Art ◽

Review Article ◽

Machine Learning Algorithms ◽

Device Optimization

This review article presents the state-of-the-art in high-throughput computational and experimental screening routines with application in organic solar cells, including materials discovery, device optimization and machine-learning algorithms.

Download Full-text

FRI0585 HIGH-THROUGHPUT METHODOLOGY FOR EMR-BASED IDENTIFICATION OF CLINICAL SUB-PHENOTYPES IN COMPLEX PATIENT POPULATIONS

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.3489 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 897.2-897

Author(s):

M. Maurits ◽

T. Huizinga ◽

M. Reinders ◽

S. Raychaudhuri ◽

E. Karlson ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Dimensionality Reduction ◽

High Throughput ◽

Brain Cancer ◽

Machine Learning Techniques ◽

Summary Statistics ◽

Medical Problems ◽

Learning Techniques ◽

Icd Codes

Background:Heterogeneity in disease populations complicates discovery of risk factors. To identify risk factors for subpopulations of diseases, we need analytical methods that can deal with unidentified disease subgroups.Objectives:Inspired by successful approaches from the Big Data field, we developed a high-throughput approach to identify subpopulations within patients with heterogeneous, complex diseases using the wealth of information available in Electronic Medical Records (EMRs).Methods:We extracted longitudinal healthcare-interaction records coded by 1,853 PheCodes[1] of the 64,819 patients from the Boston’s Partners-Biobank. Through dimensionality reduction using t-SNE[2] we created a 2D embedding of 32,424 of these patients (set A). We then identified distinct clusters post-t-SNE using DBscan[3] and visualized the relative importance of individual PheCodes within them using specialized spectrographs. We replicated this procedure in the remaining 32,395 records (set B).Results:Summary statistics of both sets were comparable (Table 1).Table 1.Summary statistics of the total Partners Biobank dataset and the 2 partitions.Set-Aset-BTotalEntries12,200,31112,177,13124,377,442Patients32,42432,39564,819Patientyears369,546.33368,597.92738,144.2unique ICD codes25,05624,95326,305unique Phecodes1,8511,8531,853We found 284 clusters in set A and 295 in set B, of which 63.4% from set A could be mapped to a cluster in set B with a median (range) correlation of 0.24 (0.03 – 0.58).Clusters represented similar yet distinct clinical phenotypes; e.g. patients diagnosed with “other headache syndrome” were separated into four distinct clusters characterized by migraines, neurofibromatosis, epilepsy or brain cancer, all resulting in patients presenting with headaches (Fig. 1 & 2). Though EMR databases tend to be noisy, our method was also able to differentiate misclassification from true cases; SLE patients with RA codes clustered separately from true RA cases.Figure 1.Two dimensional representation of Set A generated using dimensionality reduction (tSNE) and clustering (DBScan).Figure 2.Phenotype Spectrographs (PheSpecs) of four clusters characterized by “Other headache syndromes”, driven by codes relating to migraine, epilepsy, neurofibromatosis or brain cancer.Conclusion:We have shown that EMR data can be used to identify and visualize latent structure in patient categorizations, using an approach based on dimension reduction and clustering machine learning techniques. Our method can identify misclassified patients as well as separate patients with similar problems into subsets with different associated medical problems. Our approach adds a new and powerful tool to aid in the discovery of novel risk factors in complex, heterogeneous diseases.References:[1] Denny, J.C. et al. Bioinformatics (2010)[2]van der Maaten et al. Journal of Machine Learning Research (2008)[3] Ester, M. et al. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. (1996)Disclosure of Interests:Marc Maurits: None declared, Thomas Huizinga Grant/research support from: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Consultant of: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Marcel Reinders: None declared, Soumya Raychaudhuri: None declared, Elizabeth Karlson: None declared, Erik van den Akker: None declared, Rachel Knevel: None declared

Download Full-text