scholarly journals Inverse Potts model improves accuracy of phylogenetic profiling

2021 ◽  
Author(s):  
Tsukasa Fukunaga ◽  
Wataru Iwasaki

Motivation: Phylogenetic profiling is a powerful computational method for revealing the functions of function-unknown genes. Although conventional similarity evaluation measures in phylogenetic profiling showed high prediction accuracy, they have two estimation biases: an evolutionary bias and a spurious correlation bias. Existing studies have focused on the evolutionary bias, but the spurious correlation bias has not been analyzed. Results: To eliminate the spurious correlation bias, we applied an evaluation measure based on the inverse Potts model (IPM) to phylogenetic profiling. We also proposed an evaluation measure to remove both the evolutionary and spurious correlation biases using the IPM. In an empirical dataset analysis, we demonstrated that these IPM-based evaluation measures improved the prediction performance of phylogenetic profiling. In addition, we found that the integration of several evaluation measures, including the IPM-based evaluation measures, had superior performance to a single evaluation measure.

2021 ◽  
Vol 54 (1) ◽  
pp. 1-38
Author(s):  
Víctor Adrián Sosa Hernández ◽  
Raúl Monroy ◽  
Miguel Angel Medina-Pérez ◽  
Octavio Loyola-González ◽  
Francisco Herrera

Experts from different domains have resorted to machine learning techniques to produce explainable models that support decision-making. Among existing techniques, decision trees have been useful in many application domains for classification. Decision trees can make decisions in a language that is closer to that of the experts. Many researchers have attempted to create better decision tree models by improving the components of the induction algorithm. One of the main components that have been studied and improved is the evaluation measure for candidate splits. In this article, we introduce a tutorial that explains decision tree induction. Then, we present an experimental framework to assess the performance of 21 evaluation measures that produce different C4.5 variants considering 110 databases, two performance measures, and 10× 10-fold cross-validation. Furthermore, we compare and rank the evaluation measures by using a Bayesian statistical analysis. From our experimental results, we present the first two performance rankings in the literature of C4.5 variants. Moreover, we organize the evaluation measures into two groups according to their performance. Finally, we introduce meta-models that automatically determine the group of evaluation measures to produce a C4.5 variant for a new database and some further opportunities for decision tree models.


Author(s):  
J. Behmann ◽  
P. Schmitter ◽  
J. Steinrücken ◽  
L. Plümer

Detection of crop stress from hyperspectral images is of high importance for breeding and precision crop protection. However, the continuous monitoring of stress in phenotyping facilities by hyperspectral imagers produces huge amounts of uninterpreted data. In order to derive a stress description from the images, interpreting algorithms with high prediction performance are required. Based on a static model, the local stress state of each pixel has to be predicted. Due to the low computational complexity, linear models are preferable. <br><br> In this paper, we focus on drought-induced stress which is represented by discrete stages of ordinal order. We present and compare five methods which are able to derive stress levels from hyperspectral images: One-vs.-one Support Vector Machine (SVM), one-vs.-all SVM, Support Vector Regression (SVR), Support Vector Ordinal Regression (SVORIM) and Linear Ordinal SVM classification. The methods are applied on two data sets - a real world set of drought stress in single barley plants and a simulated data set. It is shown, that Linear Ordinal SVM is a powerful tool for applications which require high prediction performance under limited resources. It is significantly more efficient than the one-vs.-one SVM and even more efficient than the less accurate one-vs.-all SVM. Compared to the very compact SVORIM model, it represents the senescence process much more accurate.


Author(s):  
Qiu Xiao ◽  
Ning Zhang ◽  
Jiawei Luo ◽  
Jianhua Dai ◽  
Xiwei Tang

Abstract Accumulating evidence has shown that microRNAs (miRNAs) play crucial roles in different biological processes, and their mutations and dysregulations have been proved to contribute to tumorigenesis. In silico identification of disease-associated miRNAs is a cost-effective strategy to discover those most promising biomarkers for disease diagnosis and treatment. The increasing available omics data sources provide unprecedented opportunities to decipher the underlying relationships between miRNAs and diseases by computational models. However, most existing methods are biased towards a single representation of miRNAs or diseases and are also not capable of discovering unobserved associations for new miRNAs or diseases without association information. In this study, we present a novel computational method with adaptive multi-source multi-view latent feature learning (M2LFL) to infer potential disease-associated miRNAs. First, we adopt multiple data sources to obtain similarity profiles and capture different latent features according to the geometric characteristic of miRNA and disease spaces. Then, the multi-modal latent features are projected to a common subspace to discover unobserved miRNA-disease associations in both miRNA and disease views, and an adaptive joint graph regularization term is developed to preserve the intrinsic manifold structures of multiple similarity profiles. Meanwhile, the Lp,q-norms are imposed into the projection matrices to ensure the sparsity and improve interpretability. The experimental results confirm the superior performance of our proposed method in screening reliable candidate disease miRNAs, which suggests that M2LFL could be an efficient tool to discover diagnostic biomarkers for guiding laborious clinical trials.


Complexity ◽  
2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Zhen Shen ◽  
You-Hua Zhang ◽  
Kyungsook Han ◽  
Asoke K. Nandi ◽  
Barry Honig ◽  
...  

As one of the factors in the noncoding RNA family, microRNAs (miRNAs) are involved in the development and progression of various complex diseases. Experimental identification of miRNA-disease association is expensive and time-consuming. Therefore, it is necessary to design efficient algorithms to identify novel miRNA-disease association. In this paper, we developed the computational method of Collaborative Matrix Factorization for miRNA-Disease Association prediction (CMFMDA) to identify potential miRNA-disease associations by integrating miRNA functional similarity, disease semantic similarity, and experimentally verified miRNA-disease associations. Experiments verified that CMFMDA achieves intended purpose and application values with its short consuming-time and high prediction accuracy. In addition, we used CMFMDA on Esophageal Neoplasms and Kidney Neoplasms to reveal their potential related miRNAs. As a result, 84% and 82% of top 50 predicted miRNA-disease pairs for these two diseases were confirmed by experiment. Not only this, but also CMFMDA could be applied to new diseases and new miRNAs without any known associations, which overcome the defects of many previous computational methods.


2020 ◽  
Vol 11 ◽  
Author(s):  
Haiyong Zhao ◽  
Shuang Wang ◽  
Xiguo Yuan

Next-generation sequencing (NGS) technologies have provided great opportunities to analyze pathogenic microbes with high-resolution data. The main goal is to accurately detect microbial composition and abundances in a sample. However, high similarity among sequences from different species and the existence of sequencing errors pose various challenges. Numerous methods have been developed for quantifying microbial composition and abundance, but they are not versatile enough for the analysis of samples with mixtures of noise. In this paper, we propose a new computational method, PGMicroD, for the detection of pathogenic microbial composition in a sample using NGS data. The method first filters the potentially mistakenly mapped reads and extracts multiple species-related features from the sequencing reads of 16S rRNA. Then it trains an Support Vector Machine classifier to predict the microbial composition. Finally, it groups all multiple-mapped sequencing reads into the references of the predicted species to estimate the abundance for each kind of species. The performance of PGMicroD is evaluated based on both simulation and real sequencing data and is compared with several existing methods. The results demonstrate that our proposed method achieves superior performance. The software package of PGMicroD is available at https://github.com/BDanalysis/PGMicroD.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Zhiming Hu ◽  
Chong Liu

Grey prediction models have been widely used in various fields of society due to their high prediction accuracy; accordingly, there exists a vast majority of grey models for equidistant sequences; however, limited research is focusing on nonequidistant sequence. The development of nonequidistant grey prediction models is very slow due to their complex modeling mechanism. In order to further expand the grey system theory, a new nonequidistant grey prediction model is established in this paper. To further improve the prediction accuracy of the NEGM (1, 1, t2) model, the background values of the improved nonequidistant grey model are optimized based on Simpson formula, which is abbreviated as INEGM (1, 1, t2). Meanwhile, to verify the validity of the proposed model, this model is applied in two real-world cases in comparison with three other benchmark models, and the modeling results are evaluated through several commonly used indicators. The results of two cases show that the INEGM (1, 1, t2) model has the best prediction performance among these competitive models.


2019 ◽  
Author(s):  
David Moi ◽  
Laurent Kilchoer ◽  
Pablo S. Aguilar ◽  
Christophe Dessimoz

AbstractPhylogenetic profiling is a computational method to predict genes involved in the same biological process by identifying protein families which tend to be jointly lost or retained across the tree of life. Phylogenetic profiling has customarily been more widely used with prokaryotes than eukaryotes, because the method is thought to require many diverse genomes. There are now many eukaryotic genomes available, but these are considerably larger, and typical phylogenetic profiling methods require quadratic time or worse in the number of genes. We introduce a fast, scalable phylogenetic profiling approach entitled HogProf, which leverages hierarchical orthologous groups for the construction of large profiles and locality-sensitive hashing for efficient retrieval of similar profiles. We show that the approach outperforms Enhanced Phylogenetic Tree, a phylogeny-based method, and use the tool to reconstruct networks and query for interactors of the kinetochore complex as well as conserved proteins involved in sexual reproduction: Hap2, Spo11 and Gex1. HogProf enables large-scale phylogenetic profiling across the three domains of life, and will be useful to predict biological pathways among the hundreds of thousands of eukaryotic species that will become available in the coming few years. HogProf is available at https://github.com/DessimozLab/HogProf.


2018 ◽  
Vol 32 (29) ◽  
pp. 1850348
Author(s):  
Xu-Hua Yang ◽  
Xuhua Yang ◽  
Fei Ling ◽  
Hai-Feng Zhang ◽  
Duan Zhang ◽  
...  

Link prediction can estimate the probablity of the existence of an unknown or future edges between two arbitrary disconnected nodes (two seed nodes) in complex networks on the basis of information regarding network nodes, edges and topology. With the important practical value in many fields such as social networks, electronic commerce, data mining and biological networks, link prediction is attracting considerable attention from scientists in various fields. In this paper, we find that degree distribution and strength of two- and three-step local paths between two seed nodes can reveal effective similarity information between the two nodes. An index called local major path degree (LMPD) is proposed to estimate the probability of generating a link between two seed nodes. To indicate the efficiency of this algorithm, we compare it with nine well-known similarity indices based on local information in 12 real networks. Results show that the LMPD algorithm can achieve high prediction performance.


2006 ◽  
Vol 04 (02) ◽  
pp. 275-298 ◽  
Author(s):  
SUN-YUAN KUNG ◽  
MAN-WAI MAK ◽  
ILIAS TAGKOPOULOS

Machine learning techniques offer a viable approach to cluster discovery from microarray data, which involves identifying and classifying biologically relevant groups in genes and conditions. It has been recognized that genes (whether or not they belong to the same gene group) may be co-expressed via a variety of pathways. Therefore, they can be adequately described by a diversity of coherence models. In fact, it is known that a gene may participate in multiple pathways that may or may not be co-active under all conditions. It is therefore biologically meaningful to simultaneously divide genes into functional groups and conditions into co-active categories — leading to the so-called biclustering analysis. For this, we have proposed a comprehensive set of coherence models to cope with various plausible regulation processes. Furthermore, a multivariate biclustering analysis based on fusion of different coherence models appears to be promising because the expression level of genes from the same group may follow more than one coherence models. The simulation studies further confirm that the proposed framework enjoys the advantage of high prediction performance.


2011 ◽  
Vol 2 (1) ◽  
Author(s):  
Yusuf Durachman

It is known that many alternatives in designing an IR system. How do we know which of these techniques are effective in which  applications? Should we use stop lists? Should we stem? Should we use in- verse document frequency weighting? Information retrieval has developed  as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections. In  this research tries to present common (although many) evaluation  of measuring the effectiveness of IR systems that widely used. and the test collections that are most often used for this purpose. Then presenst the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodol-ogy that has been developed for evaluating unranked retrieval results. This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text clas-sification and why they are appropriate. This research can valuable for those want to do research in the field of IR. . Keyword: Information Retrieval, evaluation & measurement, Precion & Recall,


Sign in / Sign up

Export Citation Format

Share Document