scholarly journals Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks

PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0262056
Author(s):  
Meghana Venkata Palukuri ◽  
Edward M. Marcotte

Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells such as gene regulation. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. Most community detection algorithms tend to be unsupervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and parallel algorithms. Here, we present Super.Complex, a distributed, supervised AutoML-based pipeline for overlapping community detection in weighted networks. We also propose three new evaluation measures for the outstanding issue of comparing sets of learned and known communities satisfactorily. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities, and a parallel implementation can be run on a computer cluster for scaling to large networks. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, the COVID-19 virus, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.

2021 ◽  
Author(s):  
Meghana Venkata Palukuri ◽  
Edward M Marcotte

Protein complexes can be computationally identified from protein-interaction networks with community detection methods, suggesting new multi-protein assemblies. Most community detection algorithms tend to be un- or semi-supervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and by using parallel algorithms, respectively.  Here, we present Super.Complex, a distributed supervised machine learning pipeline for community detection in networks. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities with epsilon-greedy and pseudo-metropolis criteria, and an embarrassingly parallel implementation can be run on a computer cluster for scaling to large networks. In order to evaluate Super.Complex, we propose three new measures for the still outstanding issue of comparing sets of learned and known communities. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable and can be used in different applications of community detection, with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0 .


2007 ◽  
Vol 4 (3) ◽  
pp. 198-207 ◽  
Author(s):  
Matthias E. Futschik ◽  
Gautam Chaurasia ◽  
Anna Tschaut ◽  
Jenny Russ ◽  
M. Madan Babu ◽  
...  

Summary Modularity is a major design principle in interaction networks. Various studies have shown that protein interaction networks in prokaryotes and eukaryotes display a modular structure. A majority of the studies have been performed for the yeast interaction network, for which data have become abundant. The systematic examination of the human protein interaction network, however, is still in an early phase. To assess whether the human interaction network similarly displays a modular structure, we assembled a large protein network consisting of over 30,000 interactions. More than 670 modules were subsequently identified based on the detection of cliques. Inspection showed that these modules included numerous known protein complexes. The extracted modules were scrutinized for their coherency with respect to function, localization and expression, thereby allowing us to distinguish between stable and dynamic modules. Finally, the examination of the overlap between modules identified key proteins linking distinct molecular processes.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
YuHang Zhang ◽  
Tao Zeng ◽  
Lei Chen ◽  
ShiJian Ding ◽  
Tao Huang ◽  
...  

Coronaviruses are specific crown-shaped viruses that were first identified in the 1960s, and three typical examples of the most recent coronavirus disease outbreaks include severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS), and COVID-19. Particularly, COVID-19 is currently causing a worldwide pandemic, threatening the health of human beings globally. The identification of viral pathogenic mechanisms is important for further developing effective drugs and targeted clinical treatment methods. The delayed revelation of viral infectious mechanisms is currently one of the technical obstacles in the prevention and treatment of infectious diseases. In this study, we proposed a random walk model to identify the potential pathological mechanisms of COVID-19 on a virus–human protein interaction network, and we effectively identified a group of proteins that have already been determined to be potentially important for COVID-19 infection and for similar SARS infections, which help further developing drugs and targeted therapeutic methods against COVID-19. Moreover, we constructed a standard computational workflow for predicting the pathological biomarkers and related pharmacological targets of infectious diseases.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Ning Zhang ◽  
Min Jiang ◽  
Tao Huang ◽  
Yu-Dong Cai

The recently emergingInfluenza A/H7N9 virus is reported to be able to infect humans and cause mortality. However, viral and host factors associated with the infection are poorly understood. It is suggested by the “guilt by association” rule that interacting proteins share the same or similar functions and hence may be involved in the same pathway. In this study, we developed a computational method to identifyInfluenza A/H7N9 virus infection-related human genes based on this rule from the shortest paths in a virus-human protein interaction network. Finally, we screened out the most significant 20 human genes, which could be the potential infection related genes, providing guidelines for further experimental validation. Analysis of the 20 genes showed that they were enriched in protein binding, saccharide or polysaccharide metabolism related pathways and oxidative phosphorylation pathways. We also compared the results with those from human rhinovirus (HRV) and respiratory syncytial virus (RSV) by the same method. It was indicated that saccharide or polysaccharide metabolism related pathways might be especially associated with the H7N9 infection. These results could shed some light on the understanding of the virus infection mechanism, providing basis for future experimental biology studies and for the development of effective strategies for H7N9 clinical therapies.


Sign in / Sign up

Export Citation Format

Share Document