Simultaneous clustering of multiview biomedical data using manifold optimization

Abstract Motivation Multiview clustering has attracted much attention in recent years. Several models and algorithms have been proposed for finding the clusters. However, these methods are developed either to find the consistent/common clusters across different views, or to identify the differential clusters among different views. In reality, both consistent and differential clusters may exist in multiview datasets. Thus, development of simultaneous clustering methods such that both the consistent and the differential clusters can be identified is of great importance. Results In this paper, we proposed one method for simultaneous clustering of multiview data based on manifold optimization. The binary optimization model for finding the clusters is relaxed to a real value optimization problem on the Stiefel manifold, which is solved by the line-search algorithm on manifold. We applied the proposed method to both simulation data and four real datasets from TCGA. Both studies show that when the underlying clusters are consistent, our method performs competitive to the state-of-the-art algorithms. When there are differential clusters, our method performs much better. In the real data study, we performed experiments on cancer stratification and differential cluster (module) identification across multiple cancer subtypes. For the patients of different subtypes, both consistent clusters and differential clusters are identified at the same time. The proposed method identifies more clusters that are enriched by gene ontology and KEGG pathways. The differential clusters could be used to explain the different mechanisms for the cancer development in the patients of different subtypes. Availability and implementation Codes can be downloaded from: http://homepage.fudan.edu.cn/sqzhang/files/2018/12/MVCMOcode.zip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Pooled variable scaling for cluster analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa243 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3849-3855

Author(s):

Jakob Raymaekers ◽

Ruben H Zamar

Keyword(s):

Cluster Analysis ◽

Scale Invariance ◽

Real Data ◽

Supplementary Information ◽

Clustering Methods ◽

Medical Sciences ◽

Scale Invariant ◽

Measurement Units ◽

Scaling Variables ◽

Invariant Distances

Abstract Motivation Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research. Results We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients. Availability and implementation An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Clustering by Detecting Density Peaks and Assigning Points by Similarity-First Search Based on Weighted K-Nearest Neighbors Graph

Complexity ◽

10.1155/2020/1731075 ◽

2020 ◽

Vol 2020 ◽

pp. 1-17

Author(s):

Qi Diao ◽

Yaping Dai ◽

Qichao An ◽

Weixing Li ◽

Xiaoxue Feng ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Local Density ◽

Search Algorithm ◽

Real Data ◽

Nearest Neighbors ◽

Adjusted Rand Index ◽

Clustering Methods ◽

K Nearest Neighbors ◽

Density Peaks

This paper presents an improved clustering algorithm for categorizing data with arbitrary shapes. Most of the conventional clustering approaches work only with round-shaped clusters. This task can be accomplished by quickly searching and finding clustering methods for density peaks (DPC), but in some cases, it is limited by density peaks and allocation strategy. To overcome these limitations, two improvements are proposed in this paper. To describe the clustering center more comprehensively, the definitions of local density and relative distance are fused with multiple distances, including K-nearest neighbors (KNN) and shared-nearest neighbors (SNN). A similarity-first search algorithm is designed to search the most matching cluster centers for noncenter points in a weighted KNN graph. Extensive comparison with several existing DPC methods, e.g., traditional DPC algorithm, density-based spatial clustering of applications with noise (DBSCAN), affinity propagation (AP), FKNN-DPC, and K-means methods, has been carried out. Experiments based on synthetic data and real data show that the proposed clustering algorithm can outperform DPC, DBSCAN, AP, and K-means in terms of the clustering accuracy (ACC), the adjusted mutual information (AMI), and the adjusted Rand index (ARI).

Download Full-text

Optimizing target nodes selection for the control energy of directed complex networks

Scientific Reports ◽

10.1038/s41598-020-75101-w ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Hong Chen ◽

Ee Hou Yong

Keyword(s):

Search Algorithm ◽

Practical Importance ◽

Stiefel Manifold ◽

Scale Free ◽

Selection Strategies ◽

The Matrix ◽

Manifold Optimization ◽

Network Topologies ◽

Selection For ◽

Optimizing Control

Abstract The energy needed in controlling a complex network is a problem of practical importance. Recent works have focused on the reduction of control energy either via strategic placement of driver nodes, or by decreasing the cardinality of nodes to be controlled. However, optimizing control energy with respect to target nodes selection has yet been considered. In this work, we propose an iterative method based on Stiefel manifold optimization of selectable target node matrix to reduce control energy. We derive the matrix derivative gradient needed for the search algorithm in a general way, and search for target nodes which result in reduced control energy, assuming that driver nodes placement is fixed. Our findings reveal that the control energy is optimal when the path distances from driver nodes to target nodes are minimized. We corroborate our algorithm with extensive simulations on elementary network topologies, random and scale-free networks, as well as various real networks. The simulation results show that the control energy found using our algorithm outperforms heuristic selection strategies for choosing target nodes by a few orders of magnitude. Our work may be applicable to opinion networks, where one is interested in identifying the optimal group of individuals that the driver nodes can influence.

Download Full-text

sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling

Bioinformatics ◽

10.1093/bioinformatics/btab164 ◽

2021 ◽

Author(s):

Alma Andersson ◽

Joakim Lundeberg

Keyword(s):

Spatial Patterns ◽

Expression Profiles ◽

Synthetic Data ◽

Real Data ◽

Cell Types ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Statistical Hypothesis Testing ◽

Transcriptomics Data ◽

Transcript Profiles

Abstract Motivation Collection of spatial signals in large numbers has become a routine task in multiple omics-fields, but parsing of these rich datasets still pose certain challenges. In whole or near-full transcriptome spatial techniques, spurious expression profiles are intermixed with those exhibiting an organized structure. To distinguish profiles with spatial patterns from the background noise, a metric that enables quantification of spatial structure is desirable. Current methods designed for similar purposes tend to be built around a framework of statistical hypothesis testing, hence we were compelled to explore a fundamentally different strategy. Results We propose an unexplored approach to analyze spatial transcriptomics data, simulating diffusion of individual transcripts to extract genes with spatial patterns. The method performed as expected when presented with synthetic data. When applied to real data, it identified genes with distinct spatial profiles, involved in key biological processes or characteristic for certain cell types. Compared to existing methods, ours seemed to be less informed by the genes’ expression levels and showed better time performance when run with multiple cores. Availabilityand implementation Open-source Python package with a command line interface (CLI), freely available at https://github.com/almaan/sepal under an MIT licence. A mirror of the GitHub repository can be found at Zenodo, doi: 10.5281/zenodo.4573237. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Clustering of cancer data based on Stiefel manifold for multiple views

BMC Bioinformatics ◽

10.1186/s12859-021-04195-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jing Tian ◽

Jianping Zhao ◽

Chunhou Zheng

Keyword(s):

Optimization Problem ◽

Nearest Neighbor ◽

Search Algorithm ◽

Stiefel Manifold ◽

Omics Data ◽

K Nearest Neighbor ◽

Cancer Data ◽

Clustering Problem ◽

Multiple Datasets ◽

Cluster Class

Abstract Background In recent years, various sequencing techniques have been used to collect biomedical omics datasets. It is usually possible to obtain multiple types of omics data from a single patient sample. Clustering of omics data plays an indispensable role in biological and medical research, and it is helpful to reveal data structures from multiple collections. Nevertheless, clustering of omics data consists of many challenges. The primary challenges in omics data analysis come from high dimension of data and small size of sample. Therefore, it is difficult to find a suitable integration method for structural analysis of multiple datasets. Results In this paper, a multi-view clustering based on Stiefel manifold method (MCSM) is proposed. The MCSM method comprises three core steps. Firstly, we established a binary optimization model for the simultaneous clustering problem. Secondly, we solved the optimization problem by linear search algorithm based on Stiefel manifold. Finally, we integrated the clustering results obtained from three omics by using k-nearest neighbor method. We applied this approach to four cancer datasets on TCGA. The result shows that our method is superior to several state-of-art methods, which depends on the hypothesis that the underlying omics cluster class is the same. Conclusion Particularly, our approach has better performance than compared approaches when the underlying clusters are inconsistent. For patients with different subtypes, both consistent and differential clusters can be identified at the same time.

Download Full-text

CNV-BAC: Copy number Variation Detection in Bacterial Circular Genome

Bioinformatics ◽

10.1093/bioinformatics/btaa208 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3890-3891

Author(s):

Linjie Wu ◽

Han Wang ◽

Yuchao Xia ◽

Ruibin Xi

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Genome Structure ◽

Real Data ◽

Read Depth ◽

Supplementary Information ◽

Circular Genome ◽

Number Variation ◽

Copy Number Variation Detection ◽

Cnv Detection

Abstract Motivation Whole-genome sequencing (WGS) is widely used for copy number variation (CNV) detection. However, for most bacteria, their circular genome structure and high replication rate make reads more enriched near the replication origin. CNV detection based on read depth could be seriously influenced by such replication bias. Results We show that the replication bias is widespread using ∼200 bacterial WGS data. We develop CNV-BAC (CNV-Bacteria) that can properly normalize the replication bias and other known biases in bacterial WGS data and can accurately detect CNVs. Simulation and real data analysis show that CNV-BAC achieves the best performance in CNV detection compared with available algorithms. Availability and implementation CNV-BAC is available at https://github.com/XiDsLab/CNV-BAC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Robust partial reference-free cell composition estimation from tissue expression

Bioinformatics ◽

10.1093/bioinformatics/btaa184 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3431-3438

Author(s):

Ziyi Li ◽

Zhenxing Guo ◽

Ying Cheng ◽

Peng Jin ◽

Hao Wu

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Real Data ◽

Estimation Procedure ◽

Free Cell ◽

Biological Information ◽

Supplementary Information ◽

Tissue Samples ◽

Cell Composition ◽

Heterogeneous Tissues

Abstract Motivation In the analysis of high-throughput omics data from tissue samples, estimating and accounting for cell composition have been recognized as important steps. High cost, intensive labor requirements and technical limitations hinder the cell composition quantification using cell-sorting or single-cell technologies. Computational methods for cell composition estimation are available, but they are either limited by the availability of a reference panel or suffer from low accuracy. Results We introduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P and TOAST/+P, two partial reference-free algorithms for estimating cell composition of heterogeneous tissues based on their gene expression profiles. TOAST/-P and TOAST/+P incorporate additional biological information, including cell-type-specific markers and prior knowledge of compositions, in the estimation procedure. Extensive simulation studies and real data analyses demonstrate that the proposed methods provide more accurate and robust cell composition estimation than existing methods. Availability and implementation The proposed methods TOAST/-P and TOAST/+P are implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Driver network as a biomarker: systematic integration and network modeling of multi-omics data to derive driver signaling pathways for drug combination prediction

Bioinformatics ◽

10.1093/bioinformatics/btz109 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3709-3717 ◽

Cited By ~ 12

Author(s):

Lei Huang ◽

David Brunell ◽

Clifford Stephan ◽

James Mancuso ◽

Xiaohui Yu ◽

...

Keyword(s):

Signaling Pathways ◽

Cancer Patients ◽

Functional Module ◽

B Cell Lymphoma ◽

Drug Combinations ◽

Supplementary Information ◽

Gene Copy ◽

Omics Data ◽

Multiple Cancer ◽

Synergistic Drug Combinations

Abstract Motivation Drug combinations that simultaneously suppress multiple cancer driver signaling pathways increase therapeutic options and may reduce drug resistance. We have developed a computational systems biology tool, DrugComboExplorer, to identify driver signaling pathways and predict synergistic drug combinations by integrating the knowledge embedded in vast amounts of available pharmacogenomics and omics data. Results This tool generates driver signaling networks by processing DNA sequencing, gene copy number, DNA methylation and RNA-seq data from individual cancer patients using an integrated pipeline of algorithms, including bootstrap aggregating-based Markov random field, weighted co-expression network analysis and supervised regulatory network learning. It uses a systems pharmacology approach to infer the combinatorial drug efficacies and synergy mechanisms through drug functional module-induced regulation of target expression analysis. Application of our tool on diffuse large B-cell lymphoma and prostate cancer demonstrated how synergistic drug combinations can be discovered to inhibit multiple driver signaling pathways. Compared with existing computational approaches, DrugComboExplorer had higher prediction accuracy based on in vitro experimental validation and probability concordance index. These results demonstrate that our network-based drug efficacy screening approach can reliably prioritize synergistic drug combinations for cancer and uncover potential mechanisms of drug synergy, warranting further studies in individual cancer patients to derive personalized treatment plans. Availability and implementation DrugComboExplorer is available at https://github.com/Roosevelt-PKU/drugcombinationprediction. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting Differential Item Functioning Using Multiple-Group Cognitive Diagnosis Models

Applied Psychological Measurement ◽

10.1177/0146621620965745 ◽

2020 ◽

Vol 45 (1) ◽

pp. 37-53

Author(s):

Wenchao Ma ◽

Ragip Terzi ◽

Jimmy de la Torre

Keyword(s):

Differential Item Functioning ◽

Type I Error ◽

Search Algorithm ◽

Real Data ◽

Error Rates ◽

Cognitive Diagnosis ◽

Type I ◽

Wald Tests ◽

Multiple Group ◽

Item Functioning

This study proposes a multiple-group cognitive diagnosis model to account for the fact that students in different groups may use distinct attributes or use the same attributes but in different manners (e.g., conjunctive, disjunctive, and compensatory) to solve problems. Based on the proposed model, this study systematically investigates the performance of the likelihood ratio (LR) test and Wald test in detecting differential item functioning (DIF). A forward anchor item search procedure was also proposed to identify a set of anchor items with invariant item parameters across groups. Results showed that the LR and Wald tests with the forward anchor item search algorithm produced better calibrated Type I error rates than the ordinary LR and Wald tests, especially when items were of low quality. A set of real data were also analyzed to illustrate the use of these DIF detection procedures.

Download Full-text