Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks

Protein Complexes ◽

Fitness Function ◽

Interaction Network ◽

Interaction Networks ◽

Supervised Machine Learning

Protein complexes can be computationally identified from protein-interaction networks with community detection methods, suggesting new multi-protein assemblies. Most community detection algorithms tend to be un- or semi-supervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and by using parallel algorithms, respectively. Here, we present Super.Complex, a distributed supervised machine learning pipeline for community detection in networks. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities with epsilon-greedy and pseudo-metropolis criteria, and an embarrassingly parallel implementation can be run on a computer cluster for scaling to large networks. In order to evaluate Super.Complex, we propose three new measures for the still outstanding issue of comparing sets of learned and known communities. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable and can be used in different applications of community detection, with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0 .

Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks

PLoS ONE ◽

10.1371/journal.pone.0262056 ◽

2021 ◽

Vol 16 (12) ◽

pp. e0262056

Author(s):

Meghana Venkata Palukuri ◽

Edward M. Marcotte

Keyword(s):

Machine Learning ◽

Community Detection ◽

Protein Interaction ◽

Parallel Implementation ◽

Protein Complexes ◽

Fitness Function ◽

Interaction Network ◽

Human Protein ◽

Supervised Machine Learning

Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells such as gene regulation. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. Most community detection algorithms tend to be unsupervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and parallel algorithms. Here, we present Super.Complex, a distributed, supervised AutoML-based pipeline for overlapping community detection in weighted networks. We also propose three new evaluation measures for the outstanding issue of comparing sets of learned and known communities satisfactorily. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities, and a parallel implementation can be run on a computer cluster for scaling to large networks. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, the COVID-19 virus, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.

Discovering Network Motifs in Protein Interaction Networks

Biological Data Mining in Protein Interaction Networks ◽

10.4018/978-1-60566-398-2.ch008 ◽

2009 ◽

pp. 117-143 ◽

Cited By ~ 1

Author(s):

Raymond Wan ◽

Hiroshi Mamitsuka

Keyword(s):

Protein Interaction ◽

Graph Algorithms ◽

Undirected Graph ◽

Interaction Network ◽

Building Blocks ◽

Software Tools ◽

Interaction Networks ◽

Network Motifs

This chapter examines some of the available techniques for analyzing a protein interaction network (PIN) when depicted as an undirected graph. Within this graph, algorithms have been developed which identify “notable” smaller building blocks called network motifs. The authors examine these algorithms by dividing them into two broad categories based on two de?nitions of “notable”: (a) statistically-based methods and (b) frequency-based methods. They describe how these two classes of algorithms differ not only in terms of ef?ciency, but also in terms of the type of results that they report. Some publicly-available programs are demonstrated as part of their comparison. While most of the techniques are generic and were originally proposed for other types of networks, the focus of this chapter is on the application of these methods and software tools to PINs.

Computational identification of signaling pathways in protein interaction networks

F1000Research ◽

10.12688/f1000research.7591.1 ◽

2015 ◽

Vol 4 ◽

pp. 1522

Author(s):

Angela U. Makolo ◽

Temitayo A. Olagunju

Keyword(s):

Signaling Pathways ◽

High Throughput ◽

Protein Interaction ◽

Interaction Network ◽

Interaction Networks ◽

Protein Interaction Data ◽

Interaction Data ◽

Protein Protein Interaction

The knowledge of signaling pathways is central to understanding the biological mechanisms of organisms since it has been identified that in eukaryotic organisms, the number of signaling pathways determines the number of ways the organism will react to external stimuli. Signaling pathways are studied using protein interaction networks constructed from protein-protein interaction data obtained from high-throughput experiments. However, these high-throughput methods are known to produce very high rates of false positive and negative interactions. To construct a useful protein interaction network from this noisy data, computational methods are applied to validate the protein-protein interactions. In this study, a computational technique to identify signaling pathways from a protein interaction network constructed using validated protein-protein interaction data was designed.A weighted interaction graph of Saccharomyces Cerevisiae was constructed. The weights were obtained using a Bayesian probabilistic network to estimate the posterior probability of interaction between two proteins given the gene expression measurement as biological evidence. Only interactions above a threshold were accepted for the network model.We were able to identify some pathway segments, one of which is a segment of the pathway that signals the start of the process of meiosis in S. Cerevisiae.

P olar M apper : a computational tool for integrated visualization of protein interaction networks and mRNA expression data

Journal of The Royal Society Interface ◽

10.1098/rsif.2008.0407 ◽

2008 ◽

Vol 6 (39) ◽

pp. 881-896 ◽

Cited By ~ 10

Author(s):

Joana P. Gonçalves ◽

Mário Grãos ◽

André X.C.N. Valente

Keyword(s):

Mrna Expression ◽

Protein Interaction ◽

Interaction Network ◽

Interaction Networks ◽

System Level ◽

Heat Shock Gene ◽

Expression Data ◽

Mrna Expression Data

P olar M apper is a computational application for exposing the architecture of protein interaction networks. It facilitates the system-level analysis of mRNA expression data in the context of the underlying protein interaction network. Preliminary analysis of a human protein interaction network and comparison of the yeast oxidative stress and heat shock gene expression responses are addressed as case studies.

Functional and Transcriptional Coherency of Modules in the Human Protein Interaction Network

Journal of Integrative Bioinformatics ◽

10.1515/jib-2007-76 ◽

2007 ◽

Vol 4 (3) ◽

pp. 198-207 ◽

Cited By ~ 2

Author(s):

Matthias E. Futschik ◽

Gautam Chaurasia ◽

Anna Tschaut ◽

Jenny Russ ◽

M. Madan Babu ◽

...

Keyword(s):

Protein Interaction ◽

Protein Complexes ◽

Interaction Network ◽

Human Interaction ◽

Interaction Networks ◽

Human Protein ◽

Modular Structure ◽

Human Protein Interaction ◽

Dynamic Modules

Summary Modularity is a major design principle in interaction networks. Various studies have shown that protein interaction networks in prokaryotes and eukaryotes display a modular structure. A majority of the studies have been performed for the yeast interaction network, for which data have become abundant. The systematic examination of the human protein interaction network, however, is still in an early phase. To assess whether the human interaction network similarly displays a modular structure, we assembled a large protein network consisting of over 30,000 interactions. More than 670 modules were subsequently identified based on the detection of cliques. Inspection showed that these modules included numerous known protein complexes. The extracted modules were scrutinized for their coherency with respect to function, localization and expression, thereby allowing us to distinguish between stable and dynamic modules. Finally, the examination of the overlap between modules identified key proteins linking distinct molecular processes.

The PathLinker app: Connect the dots in protein interaction networks

F1000Research ◽

10.12688/f1000research.9909.1 ◽

2017 ◽

Vol 6 ◽

pp. 58 ◽

Cited By ~ 12

Author(s):

Daniel P. Gil ◽

Jeffrey N. Law ◽

T. M. Murali

Keyword(s):

Transcription Factors ◽

Signaling Pathways ◽

Signaling Pathway ◽

Protein Interaction ◽

Interaction Network ◽

Interaction Networks ◽

Graph Theoretic ◽

Manual Curation

PathLinker is a graph-theoretic algorithm for reconstructing the interactions in a signaling pathway of interest. It efficiently computes multiple short paths within a background protein interaction network from the receptors to transcription factors (TFs) in a pathway. We originally developed PathLinker to complement manual curation of signaling pathways, which is slow and painstaking. The method can be used in general to connect any set of sources to any set of targets in an interaction network. The app presented here makes the PathLinker functionality available to Cytoscape users. We present an example where we used PathLinker to compute and analyze the network of interactions connecting proteins that are perturbed by the drug lovastatin.