A Latent Feature Model Approach to Biclustering

Author(s):  
José Caldas ◽  
Samuel Kaski

Biclustering is the unsupervised learning task of mining a data matrix for useful submatrices, for instance groups of genes that are co-expressed under particular biological conditions. As these submatrices are expected to partly overlap, a significant challenge in biclustering is to develop methods that are able to detect overlapping biclusters. The authors propose a probabilistic mixture modelling framework for biclustering biological data that lends itself to various data types and allows biclusters to overlap. Their framework is akin to the latent feature and mixture-of-experts model families, with inference and parameter estimation being performed via a variational expectation-maximization algorithm. The model compares favorably with competing approaches, both in a binary DNA copy number variation data set and in a miRNA expression data set, indicating that it may potentially be used as a general-problem solving tool in biclustering.

2018 ◽  
Vol 20 (4) ◽  
pp. 1450-1465 ◽  
Author(s):  
Juan Xie ◽  
Anjun Ma ◽  
Anne Fennell ◽  
Qin Ma ◽  
Jing Zhao

Abstract Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.


Geophysics ◽  
2014 ◽  
Vol 79 (1) ◽  
pp. IM1-IM9 ◽  
Author(s):  
Nathan Leon Foks ◽  
Richard Krahenbuhl ◽  
Yaoguo Li

Compressive inversion uses computational algorithms that decrease the time and storage needs of a traditional inverse problem. Most compression approaches focus on the model domain, and very few, other than traditional downsampling focus on the data domain for potential-field applications. To further the compression in the data domain, a direct and practical approach to the adaptive downsampling of potential-field data for large inversion problems has been developed. The approach is formulated to significantly reduce the quantity of data in relatively smooth or quiet regions of the data set, while preserving the signal anomalies that contain the relevant target information. Two major benefits arise from this form of compressive inversion. First, because the approach compresses the problem in the data domain, it can be applied immediately without the addition of, or modification to, existing inversion software. Second, as most industry software use some form of model or sensitivity compression, the addition of this adaptive data sampling creates a complete compressive inversion methodology whereby the reduction of computational cost is achieved simultaneously in the model and data domains. We applied the method to a synthetic magnetic data set and two large field magnetic data sets; however, the method is also applicable to other data types. Our results showed that the relevant model information is maintained after inversion despite using 1%–5% of the data.


1987 ◽  
Vol 65 (3) ◽  
pp. 691-707 ◽  
Author(s):  
A. F. L. Nemec ◽  
R. O. Brinkhurst

A data matrix of 23 generic or subgeneric taxa versus 24 characters and a shorter matrix of 15 characters were analyzed by means of ordination, cluster analyses, parsimony, and compatibility methods (the last two of which are phylogenetic tree reconstruction methods) and the results were compared inter alia and with traditional methods. Various measures of fit for evaluating the parsimony methods were employed. There were few compatible characters in the data set, and much homoplasy, but most analyses separated a group based on Stylaria from the rest of the family, which could then be separated into four groups, recognized here for the first time as tribes (Naidini, Derini, Pristinini, and Chaetogastrini). There was less consistency of results within these groups. Modern methods produced results that do not conflict with traditional groupings. The Jaccard coefficient minimizes the significance of symplesiomorphy and complete linkage avoids chaining effects and corresponds to actual similarities, unlike single or average linkage methods, respectively. Ordination complements cluster analysis. The Wagner parsimony method was superior to the less flexible Camin–Sokal approach and produced better measure of fit statistics. All of the aforementioned methods contain areas susceptible to subjective decisions but, nevertheless, they lead to a complete disclosure of both the methods used and the assumptions made, and facilitate objective hypothesis testing rather than the presentation of conflicting phylogenies based on the different, undisclosed premises of manual approaches.


2021 ◽  
Vol 29 ◽  
pp. 287-295
Author(s):  
Zhiming Zhou ◽  
Haihui Huang ◽  
Yong Liang

BACKGROUND: In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection. OBJECTIVE: The aim of this paper is to give the model efficient gene selection capability. METHODS: In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification. RESULTS: Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. CONCLUSIONS: The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.


2020 ◽  
Vol 8 ◽  
Author(s):  
Devasis Bassu ◽  
Peter W. Jones ◽  
Linda Ness ◽  
David Shallcross

Abstract In this paper, we present a theoretical foundation for a representation of a data set as a measure in a very large hierarchically parametrized family of positive measures, whose parameters can be computed explicitly (rather than estimated by optimization), and illustrate its applicability to a wide range of data types. The preprocessing step then consists of representing data sets as simple measures. The theoretical foundation consists of a dyadic product formula representation lemma, and a visualization theorem. We also define an additive multiscale noise model that can be used to sample from dyadic measures and a more general multiplicative multiscale noise model that can be used to perturb continuous functions, Borel measures, and dyadic measures. The first two results are based on theorems in [15, 3, 1]. The representation uses the very simple concept of a dyadic tree and hence is widely applicable, easily understood, and easily computed. Since the data sample is represented as a measure, subsequent analysis can exploit statistical and measure theoretic concepts and theories. Because the representation uses the very simple concept of a dyadic tree defined on the universe of a data set, and the parameters are simply and explicitly computable and easily interpretable and visualizable, we hope that this approach will be broadly useful to mathematicians, statisticians, and computer scientists who are intrigued by or involved in data science, including its mathematical foundations.


Author(s):  
Stylianos Somarakis ◽  
Athanassios Machias

Data from bottom trawl surveys conducted each summer, winter and spring on the Cretan shelf from 1988 to 1991, were used to study the age, growth, maturity and bathymetric distribution of red pandora (Pagellus erythrinus). The good agreement of back-calculated and observed lengths-at-age with length frequencies and the marginal increment analysis, supported the annual nature of scale marks. A comparison of available growth data from the Mediterranean and the Atlantic revealed higher lengths-at-age for red pandora in the north-western Mediterranean and the Atlantic than in the central and eastern Mediterranean. The auximetric analysis, i.e. the double logarithmic plot of the parameter K of the von Bertalanffy growth function vs asymptotic length (L∞), showed a strong negative relationship for the central and eastern Mediterranean data set, implying a common ‘growth space’ for the populations in these areas. Lengths-at-maturity were lower on the Cretan shelf than in the Atlantic. These differences were attributed to the synergistic combination of trophic and thermal conditions.  Depth, temperature and salinity data were combined with biological data on abundance, fish size, age and maturity. In general, mean size increased with bottom depth because smaller individuals tended to be found in shallower and warmer waters. Individuals having reached first maturity were mainly distributed in the periphery of the algal/angiosperm meadows (60–80 m). All detailed studies of the bathymetric distribution and movements of shelf-dwelling demersal species (Mullus barbatus, Mullus surmuletus, Lepidotrigla cavillone and Pagellus erythrinus) in the Mediterranean show that these species are characterized by a spring–summer spawning season, a high concentration of spawning adults at mid-shelf depths, and nursery grounds located in the vegetated shallows. This multispecies pattern might have an adaptive function with both ecological and management implications.


2018 ◽  
Vol 15 (6) ◽  
pp. 172988141881470
Author(s):  
Nezih Ergin Özkucur ◽  
H Levent Akın

Self-localization in autonomous robots is one of the fundamental issues in the development of intelligent robots, and processing of raw sensory information into useful features is an integral part of this problem. In a typical scenario, there are several choices for the feature extraction algorithm, and each has its weaknesses and strengths depending on the characteristics of the environment. In this work, we introduce a localization algorithm that is capable of capturing the quality of a feature type based on the local environment and makes soft selection of feature types throughout different regions. A batch expectation–maximization algorithm is developed for both discrete and Monte Carlo localization models, exploiting the probabilistic pose estimations of the robot without requiring ground truth poses and also considering different observation types as blackbox algorithms. We tested our method in simulations, data collected from an indoor environment with a custom robot platform and a public data set. The results are compared with the individual feature types as well as naive fusion strategy.


2021 ◽  
Author(s):  
Andrew J Kavran ◽  
Aaron Clauset

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data.Conclusions: Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.


2020 ◽  
pp. 865-874
Author(s):  
Enrico Santus ◽  
Tal Schuster ◽  
Amir M. Tahmasebi ◽  
Clara Li ◽  
Adam Yala ◽  
...  

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.


Sign in / Sign up

Export Citation Format

Share Document