scholarly journals Review of Statistical Learning Methods in Integrated Omics Studies (An Integrated Information Science)

2018 ◽  
Vol 12 ◽  
pp. 117793221875929 ◽  
Author(s):  
Irene Sui Lan Zeng ◽  
Thomas Lumley

Integrated omics is becoming a new channel for investigating the complex molecular system in modern biological science and sets a foundation for systematic learning for precision medicine. The statistical/machine learning methods that have emerged in the past decade for integrated omics are not only innovative but also multidisciplinary with integrated knowledge in biology, medicine, statistics, machine learning, and artificial intelligence. Here, we review the nontrivial classes of learning methods from the statistical aspects and streamline these learning methods within the statistical learning framework. The intriguing findings from the review are that the methods used are generalizable to other disciplines with complex systematic structure, and the integrated omics is part of an integrated information science which has collated and integrated different types of information for inferences and decision making. We review the statistical learning methods of exploratory and supervised learning from 42 publications. We also discuss the strengths and limitations of the extended principal component analysis, cluster analysis, network analysis, and regression methods. Statistical techniques such as penalization for sparsity induction when there are fewer observations than the number of features and using Bayesian approach when there are prior knowledge to be integrated are also included in the commentary. For the completeness of the review, a table of currently available software and packages from 23 publications for omics are summarized in the appendix.

Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2344 ◽  
Author(s):  
Federico Pittino ◽  
Michael Puggl ◽  
Thomas Moldaschl ◽  
Christina Hirschl

Anomaly detection is becoming increasingly important to enhance reliability and resiliency in the Industry 4.0 framework. In this work, we investigate different methods for anomaly detection on in-production manufacturing machines taking into account their variability, both in operation and in wear conditions. We demonstrate how the nature of the available data, featuring any anomaly or not, is of importance for the algorithmic choice, discussing both statistical machine learning methods and control charts. We finally develop methods for automatic anomaly detection, which obtain a recall close to one on our data. Our developed methods are designed not to rely on a continuous recalibration and hand-tuning by the machine user, thereby allowing their deployment in an in-production environment robustly and efficiently.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractNowadays, huge data quantities are collected and analyzed for delivering deep insights into biological processes and human behavior. This chapter assesses the use of big data for prediction and estimation through statistical machine learning and its applications in agriculture and genetics in general, and specifically, for genome-based prediction and selection. First, we point out the importance of data and how the use of data is reshaping our way of living. We also provide the key elements of genomic selection and its potential for plant improvement. In addition, we analyze elements of modeling with machine learning methods applied to genomic selection and stress their importance as a predictive methodology. Two cultures of model building are analyzed and discussed: prediction and inference; by understanding modeling building, researchers will be able to select the best model/method for each circumstance. Within this context, we explain the differences between nonparametric models (predictors are constructed according to information derived from data) and parametric models (all the predictors take predetermined forms with the response) as well their type of effects: fixed, random, and mixed. Basic elements of linear algebra are provided to facilitate understanding the contents of the book. This chapter also contains examples of the different types of data using supervised, unsupervised, and semi-supervised learning methods.


2020 ◽  
Vol 36 (17) ◽  
pp. 4590-4598
Author(s):  
Robert Page ◽  
Ruriko Yoshida ◽  
Leon Zhang

Abstract Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. Availability and implementation Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Simon Michel ◽  
Didier Swingedouw ◽  
Marie Chavent ◽  
Pablo Ortega ◽  
Juliette Mignot ◽  
...  

Abstract. Modes of climate variability strongly impact our climate and thus human society. Nevertheless, their statistical properties remain poorly known due to the short time frame of instrumental measurements. Reconstructing these modes further back in time using statistical learning methods applied to proxy records is a useful way to improve our understanding of their behaviours and meteorological impacts. For doing so, several statistical reconstruction methods exist, among which the Principal Component Regression is one of the most widely used. Additional predictive, and then reconstructive, statistical methods have been developed recently, following the advent of big data. Here, we provide to the climate community a multi-statistical toolbox, based on four statistical learning methods and cross validation algorithms, that enables systematic reconstruction of any climate mode of variability as long as there are proxy records that overlap in time with the observed variations of the considered mode. The efficiency of the methods can vary, depending on the statistical properties of the mode and the learning set, thereby allowing to assess sensitivity related to the reconstruction techniques. This toolbox is modular in the sense that it allows different inputs like the proxy database or the chosen variability mode. As an example, the toolbox is here applied to the reconstruction of the North Atlantic Oscillation (NAO) by using Pages 2K database. In order to identify the most reliable reconstruction among those given by the different methods, we also investigate the sensitivity to the methodological setup to other properties such as the number and the nature of the proxy records used as predictors or the reconstruction period targeted. The best reconstruction of the NAO that we thus obtain shows significant correlation with former reconstructions, but exhibits better validation scores.


Polymers ◽  
2021 ◽  
Vol 13 (5) ◽  
pp. 825
Author(s):  
Kaixin Liu ◽  
Zhengyang Ma ◽  
Yi Liu ◽  
Jianguo Yang ◽  
Yuan Yao

Increasing machine learning methods are being applied to infrared non-destructive assessment for internal defects assessment of composite materials. However, most of them extract only linear features, which is not in accord with the nonlinear characteristics of infrared data. Moreover, limited infrared images tend to restrict the data analysis capabilities of machine learning methods. In this work, a novel generative kernel principal component thermography (GKPCT) method is proposed for defect detection of carbon fiber reinforced polymer (CFRP) composites. Specifically, the spectral normalization generative adversarial network is proposed to augment the thermograms for model construction. Sequentially, the KPCT method is used by feature mapping of all thermogram data using kernel principal component analysis, which allows for differentiation of defects and background in the dimensionality-reduced data. Additionally, a defect-background separation metric is designed to help the performance evaluation of data analysis methods. Experimental results on CFRP demonstrate the feasibility and advantages of the proposed GKPCT method.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Xiao-Yan Gao ◽  
Abdelmegeid Amin Ali ◽  
Hassan Shaban Hassan ◽  
Eman M. Anwar

Heart disease is the deadliest disease and one of leading causes of death worldwide. Machine learning is playing an essential role in the medical side. In this paper, ensemble learning methods are used to enhance the performance of predicting heart disease. Two features of extraction methods: linear discriminant analysis (LDA) and principal component analysis (PCA), are used to select essential features from the dataset. The comparison between machine learning algorithms and ensemble learning methods is applied to selected features. The different methods are used to evaluate models: accuracy, recall, precision, F-measure, and ROC.The results show the bagging ensemble learning method with decision tree has achieved the best performance.


Author(s):  
Pu Tian

Factorization reduces computational complexity and is therefore an important tool in statistical machine learning of high dimensional systems. Conventional molecular modeling, including molecular dynamics and Monte Carlo simulations of molecular systems, is a large research field based on approximate factorization of molecular interactions. Recently, the local distribution theory was proposed to factorize global joint distribution of a given molecular system into trainable local distributions. Belief propagation algorithms are a family of exact factorization algorithms for trees and are extended to approximate loopy belief propagation algorithms for graphs with loops. Despite the fact that factorization of probability distribution is their common foundation, computational research in molecular systems and machine learning studies utilizing belief propagation algorithms have been carried out independently with respective track of algorithm development. The connection and differences among these factorization algorithms are briefly presented in this perspective, with the hope to intrigue further development in factorization algorithms for physical modeling of complex molecular systems.


Sign in / Sign up

Export Citation Format

Share Document