scholarly journals A Fast Lasso-Based Method for Inferring Higher-Order Interactions

2021 ◽  
Author(s):  
Kieran Elmes ◽  
Astra Heywood ◽  
Zhiyi Huang ◽  
Alex Gavryushkin

Large-scale genotype-phenotype screens provide a wealth of data for identifying molecular alterations associated with a phenotype. Epistatic effects play an important role in such association studies. For example, siRNA perturbation screens can be used to identify combinatorial gene-silencing effects. In bacteria, epistasis has practical consequences in determining antimicrobial resistance as the genetic background of a strain plays an important role in determining resistance. Recently developed tools scale to human exome-wide screens for pairwise interactions, but none to date have included the possibility of three-way interactions. Expanding upon recent state-of-the art methods, we make a number of improvements to the performance on large-scale data, making consideration of three-way interactions possible. We demonstrate our proposed method, Pint, on both simulated and real data sets, including antibiotic resistance testing and siRNA perturbation screens. Pint outperforms known methods in simulated data, and identifies a number of biologically plausible gene effects in both the antibiotic and siRNA models. For example, we have identified a combination of known tumor suppressor genes that is predicted (using Pint) to cause a significant increase in cell proliferation.

2021 ◽  
Author(s):  
Kieran Elmes ◽  
Astra Heywood ◽  
Zhiyi Huang ◽  
Alex Gavryushkin

AbstractLarge-scale genotype-phenotype screens provide a wealth of data for identifying molecular alternations associated with a phenotype. Epistatic effects play an important role in such association studies. For example, siRNA perturbation screens can be used to identify pairwise gene-silencing effects. In bacteria, epistasis has practical consequences in determining antimicrobial resistance as the genetic background of a strain plays an important role in determining resistance. Existing computational tools which account for epistasis do not scale to human exome-wide screens and struggle with genetically diverse bacterial species such as Pseudomonas aeruginosa. Combining earlier work in interaction detection with recent advances in integer compression, we present a method for epistatic interaction detection on sparse (human) exome-scale data, and an R implementation in the package Pint. Our method takes advantage of sparsity in the input data and recent progress in integer compression to perform lasso-penalised linear regression on all pairwise combinations of the input, estimating up to 200 million potential effects, including epistatic interactions. Hence the human exome is within the reach of our method, assuming one parameter per gene and one parameter per epistatic effect for every pair of genes. We demonstrate Pint on both simulated and real data sets, including antibiotic resistance testing and siRNA perturbation screens.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2016 ◽  
Vol 2016 ◽  
pp. 1-9
Author(s):  
Size Bi ◽  
Xiaoyu Han ◽  
Jing Tian ◽  
Xiao Liang ◽  
Yang Wang ◽  
...  

This paper investigates a homotopy-based method for embedding with hundreds of thousands of data items that yields a parallel algorithm suitable for running on a distributed system. Current eigenvalue-based embedding algorithms attempt to use a sparsification of the distance matrix to approximate a low-dimensional representation when handling large-scale data sets. The main reason of taking approximation is that it is still hindered by the eigendecomposition bottleneck for high-dimensional matrices in the embedding process. In this study, a homotopy continuation algorithm is applied for improving this embedding model by parallelizing the corresponding eigendecomposition. The eigenvalue solution is converted to the operation of ordinary differential equations with initialized values, and all isolated positive eigenvalues and corresponding eigenvectors can be obtained in parallel according to predicting eigenpaths. Experiments on the real data sets show that the homotopy-based approach is potential to be implemented for millions of data sets.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Camilo Broc ◽  
Therese Truong ◽  
Benoit Liquet

Abstract Background The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. Results Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. Conclusion The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.


Genetics ◽  
2003 ◽  
Vol 165 (4) ◽  
pp. 2269-2282
Author(s):  
D Mester ◽  
Y Ronin ◽  
D Minkov ◽  
E Nevo ◽  
A Korol

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.


Author(s):  
Jun Huang ◽  
Linchuan Xu ◽  
Jing Wang ◽  
Lei Feng ◽  
Kenji Yamanishi

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.


2021 ◽  
Vol 27 (7) ◽  
pp. 667-692
Author(s):  
Lamia Berkani ◽  
Lylia Betit ◽  
Louiza Belarif

Clustering-based approaches have been demonstrated to be efficient and scalable to large-scale data sets. However, clustering-based recommender systems suffer from relatively low accuracy and coverage. To address these issues, we propose in this article an optimized multiview clustering approach for the recommendation of items in social networks. First, the selection of the initial medoids is optimized using the Bees Swarm optimization algorithm (BSO) in order to generate better partitions (i.e. refining the quality of medoids according to the objective function). Then, the multiview clustering (MV) is applied, where users are iteratively clustered from the views of both rating patterns and social information (i.e. friendships and trust). Finally, a framework is proposed for testing the different alternatives, namely: (1) the standard recommendation algorithms; (2) the clustering-based and the optimized clustering-based recommendation algorithms using BSO; and (3) the MV and the optimized MV (BSO-MV) algorithms. Experimental results conducted on two real-world datasets demonstrate the effectiveness of the proposed BSO-MV algorithm in terms of improving accuracy, as it outperforms the existing related approaches and baselines.


2014 ◽  
Vol 571-572 ◽  
pp. 497-501 ◽  
Author(s):  
Qi Lv ◽  
Wei Xie

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.


Sign in / Sign up

Export Citation Format

Share Document