SolidBin: improving metagenome binning with semi-supervised normalized cut

Abstract Motivation Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. Results We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. Availability and implementation https://github.com/sufforest/SolidBin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CMF-Impute: an accurate imputation tool for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa109 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3139-3147 ◽

Cited By ~ 5

Author(s):

Junlin Xu ◽

Lijun Cai ◽

Bo Liao ◽

Wen Zhu ◽

JiaLiang Yang

Keyword(s):

Single Cell ◽

Pearson Correlation ◽

Cell Lineage ◽

Supplementary Information ◽

Adjusted Rand Index ◽

Clustering Methods ◽

Normalized Mutual Information ◽

Cell Subpopulations ◽

Matlab Package ◽

Expression Matrix

Abstract Motivation Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level. However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix). Results By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix. We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets. For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient. For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information. Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories. Availability and implementation CMF-Impute is written as a Matlab package which is available at https://github.com/xujunlin123/CMFImpute.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Graphlet Laplacians for topology-function and topology-disease relationships

Bioinformatics ◽

10.1093/bioinformatics/btz455 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5226-5234 ◽

Cited By ~ 2

Author(s):

Sam F L Windels ◽

Noël Malod-Dognin ◽

Nataša Pržulj

Keyword(s):

Biological Networks ◽

Spectral Clustering ◽

Biological Information ◽

Supplementary Information ◽

Biological Functions ◽

Topological Information ◽

Laplacian Matrices ◽

And Topology ◽

Spectral Embedding ◽

Pan Cancer

Abstract Motivation Laplacian matrices capture the global structure of networks and are widely used to study biological networks. However, the local structure of the network around a node can also capture biological information. Local wiring patterns are typically quantified by counting how often a node touches different graphlets (small, connected, induced sub-graphs). Currently available graphlet-based methods do not consider whether nodes are in the same network neighbourhood. To combine graphlet-based topological information and membership of nodes to the same network neighbourhood, we generalize the Laplacian to the Graphlet Laplacian, by considering a pair of nodes to be ‘adjacent’ if they simultaneously touch a given graphlet. Results We utilize Graphlet Laplacians to generalize spectral embedding, spectral clustering and network diffusion. Applying Graphlet Laplacian-based spectral embedding, we visually demonstrate that Graphlet Laplacians capture biological functions. This result is quantified by applying Graphlet Laplacian-based spectral clustering, which uncovers clusters enriched in biological functions dependent on the underlying graphlet. We explain the complementarity of biological functions captured by different Graphlet Laplacians by showing that they capture different local topologies. Finally, diffusing pan-cancer gene mutation scores based on different Graphlet Laplacians, we find complementary sets of cancer-related genes. Hence, we demonstrate that Graphlet Laplacians capture topology-function and topology-disease relationships in biological networks. Availability and implementation http://www0.cs.ucl.ac.uk/staff/natasa/graphlet-laplacian/index.html Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Orthogonal and Nonnegative Graph Reconstruction for Large Scale Clustering

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/251 ◽

2017 ◽

Cited By ~ 2

Author(s):

Junwei Han ◽

Kai Xiong ◽

Feiping Nie

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Computational Cost ◽

Post Processing ◽

Normalized Cut ◽

Clustering Problem ◽

Novel Approach ◽

Graph Reconstruction ◽

Final Cluster ◽

High Computational Cost

Spectral clustering has been widely used due to its simplicity for solving graph clustering problem in recent years. However, it suffers from the high computational cost as data grow in scale, and is limited by the performance of post-processing. To address these two problems simultaneously, in this paper, we propose a novel approach denoted by orthogonal and nonnegative graph reconstruction (ONGR) that scales linearly with the data size. For the relaxation of Normalized Cut, we add nonnegative constraint to the objective. Due to the nonnegativity, ONGR offers interpretability that the final cluster labels can be directly obtained without post-processing. Extensive experiments on clustering tasks demonstrate the effectiveness of the proposed method.

Download Full-text

CorGAT: a tool for the functional annotation of SARS-CoV-2 genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa1047 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Marco Antonio Tangaro ◽

Pietro Mandreoli ◽

David S Horner ◽

...

Keyword(s):

Functional Annotation ◽

Ad Hoc ◽

State Of The Art ◽

Supplementary Information ◽

Genomic Sequences ◽

Supplementary Data ◽

Evolutionary Patterns ◽

Genomic Variants ◽

Art Methods ◽

Available Resources

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Robust partial reference-free cell composition estimation from tissue expression

Bioinformatics ◽

10.1093/bioinformatics/btaa184 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3431-3438

Author(s):

Ziyi Li ◽

Zhenxing Guo ◽

Ying Cheng ◽

Peng Jin ◽

Hao Wu

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Real Data ◽

Estimation Procedure ◽

Free Cell ◽

Biological Information ◽

Supplementary Information ◽

Tissue Samples ◽

Cell Composition ◽

Heterogeneous Tissues

Abstract Motivation In the analysis of high-throughput omics data from tissue samples, estimating and accounting for cell composition have been recognized as important steps. High cost, intensive labor requirements and technical limitations hinder the cell composition quantification using cell-sorting or single-cell technologies. Computational methods for cell composition estimation are available, but they are either limited by the availability of a reference panel or suffer from low accuracy. Results We introduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P and TOAST/+P, two partial reference-free algorithms for estimating cell composition of heterogeneous tissues based on their gene expression profiles. TOAST/-P and TOAST/+P incorporate additional biological information, including cell-type-specific markers and prior knowledge of compositions, in the estimation procedure. Extensive simulation studies and real data analyses demonstrate that the proposed methods provide more accurate and robust cell composition estimation than existing methods. Availability and implementation The proposed methods TOAST/-P and TOAST/+P are implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Adaptive Structure Concept Factorization for Multiview Clustering

Neural Computation ◽

10.1162/neco_a_01055 ◽

2018 ◽

Vol 30 (4) ◽

pp. 1080-1103 ◽

Cited By ~ 10

Author(s):

Kun Zhan ◽

Jinhui Shi ◽

Jing Wang ◽

Haibo Wang ◽

Yuange Xie

Keyword(s):

Nonnegative Matrix Factorization ◽

State Of The Art ◽

Nonnegative Matrix ◽

Adaptive Method ◽

Data Sets ◽

Clustering Methods ◽

Normalized Mutual Information ◽

Adaptive Structure ◽

Concept Factorization ◽

Multiview Clustering

Most existing multiview clustering methods require that graph matrices in different views are computed beforehand and that each graph is obtained independently. However, this requirement ignores the correlation between multiple views. In this letter, we tackle the problem of multiview clustering by jointly optimizing the graph matrix to make full use of the data correlation between views. With the interview correlation, a concept factorization–based multiview clustering method is developed for data integration, and the adaptive method correlates the affinity weights of all views. This method differs from nonnegative matrix factorization–based clustering methods in that it can be applicable to data sets containing negative values. Experiments are conducted to demonstrate the effectiveness of the proposed method in comparison with state-of-the-art approaches in terms of accuracy, normalized mutual information, and purity.

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Inference of an Integrative, Executable Network for Rheumatoid Arthritis Combining Data-Driven Machine Learning Approaches and a State-of-the-Art Mechanistic Disease Map

Journal of Personalized Medicine ◽

10.3390/jpm11080785 ◽

2021 ◽

Vol 11 (8) ◽

pp. 785

Author(s):

Quentin Miagoux ◽

Vidisha Singh ◽

Dereck de Mézquita ◽

Valerie Chaudru ◽

Mohamed Elati ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

State Of The Art ◽

Biological Information ◽

Response To Treatment ◽

Patient Specific ◽

Learning Approaches ◽

Data Types ◽

Disease Heterogeneity ◽

Combining Data

Rheumatoid arthritis (RA) is a multifactorial, complex autoimmune disease that involves various genetic, environmental, and epigenetic factors. Systems biology approaches provide the means to study complex diseases by integrating different layers of biological information. Combining multiple data types can help compensate for missing or conflicting information and limit the possibility of false positives. In this work, we aim to unravel mechanisms governing the regulation of key transcription factors in RA and derive patient-specific models to gain more insights into the disease heterogeneity and the response to treatment. We first use publicly available transcriptomic datasets (peripheral blood) relative to RA and machine learning to create an RA-specific transcription factor (TF) co-regulatory network. The TF cooperativity network is subsequently enriched in signalling cascades and upstream regulators using a state-of-the-art, RA-specific molecular map. Then, the integrative network is used as a template to analyse patients’ data regarding their response to anti-TNF treatment and identify master regulators and upstream cascades affected by the treatment. Finally, we use the Boolean formalism to simulate in silico subparts of the integrated network and identify combinations and conditions that can switch on or off the identified TFs, mimicking the effects of single and combined perturbations.

Download Full-text

DeepPurpose: a deep learning library for drug–target interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa1005 ◽

2020 ◽

Author(s):

Kexin Huang ◽

Tianfan Fu ◽

Lucas M Glass ◽

Marinka Zitnik ◽

Cao Xiao ◽

...

Keyword(s):

Deep Learning ◽

Drug Target ◽

Prediction Models ◽

State Of The Art ◽

Supplementary Information ◽

Target Interaction ◽

Interaction Prediction ◽

Computer Scientists ◽

Benchmark Datasets ◽

Biomedical Field

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Scene Text Detector for Text with Arbitrary Shapes

Mathematical Problems in Engineering ◽

10.1155/2020/8916028 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Weijia Wu ◽

Jici Xing ◽

Cheng Yang ◽

Yuxing Wang ◽

Hong Zhou

Keyword(s):

State Of The Art ◽

Recognition Task ◽

Text Detection ◽

Supplementary Information ◽

Complex Environment ◽

Irregular Shapes ◽

Scene Text Detection ◽

Scene Text ◽

F Measure ◽

Instance Segmentation

The performance of text detection is crucial for the subsequent recognition task. Currently, the accuracy of the text detector still needs further improvement, particularly those with irregular shapes in a complex environment. We propose a pixel-wise method based on instance segmentation for scene text detection. Specifically, a text instance is split into five components: a Text Skeleton and four Directional Pixel Regions, then restoring itself based on these elements and receiving supplementary information from other areas when one fails. Besides, a Confidence Scoring Mechanism is designed to filter characters similar to text instances. Experiments on several challenging benchmarks demonstrate that our method achieves state-of-the-art results in scene text detection with an F-measure of 84.6% on Total-Text and 86.3% on CTW1500.

Download Full-text