scholarly journals It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data

2018 ◽  
Vol 20 (4) ◽  
pp. 1450-1465 ◽  
Author(s):  
Juan Xie ◽  
Anjun Ma ◽  
Anne Fennell ◽  
Qin Ma ◽  
Jing Zhao

Abstract Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.

Author(s):  
José Caldas ◽  
Samuel Kaski

Biclustering is the unsupervised learning task of mining a data matrix for useful submatrices, for instance groups of genes that are co-expressed under particular biological conditions. As these submatrices are expected to partly overlap, a significant challenge in biclustering is to develop methods that are able to detect overlapping biclusters. The authors propose a probabilistic mixture modelling framework for biclustering biological data that lends itself to various data types and allows biclusters to overlap. Their framework is akin to the latent feature and mixture-of-experts model families, with inference and parameter estimation being performed via a variational expectation-maximization algorithm. The model compares favorably with competing approaches, both in a binary DNA copy number variation data set and in a miRNA expression data set, indicating that it may potentially be used as a general-problem solving tool in biclustering.


2013 ◽  
Vol 12 (3-4) ◽  
pp. 291-307 ◽  
Author(s):  
Ilir Jusufi ◽  
Andreas Kerren ◽  
Falk Schreiber

Ontologies and hierarchical clustering are both important tools in biology and medicine to study high-throughput data such as transcriptomics and metabolomics data. Enrichment of ontology terms in the data is used to identify statistically overrepresented ontology terms, giving insight into relevant biological processes or functional modules. Hierarchical clustering is a standard method to analyze and visualize data to find relatively homogeneous clusters of experimental data points. Both methods support the analysis of the same data set but are usually considered independently. However, often a combined view is desired: visualizing a large data set in the context of an ontology under consideration of a clustering of the data. This article proposes new visualization methods for this task. They allow for interactive selection and navigation to explore the data under consideration as well as visual analysis of mappings between ontology- and cluster-based space-filling representations. In this context, we discuss our approach together with specific properties of the biological input data and identify features that make our approach easily usable for domain experts.


2006 ◽  
Vol 4 (1) ◽  
pp. 68-80 ◽  
Author(s):  
Pavlina Simeonova ◽  
Costel Sarbu ◽  
Thomas Spanos ◽  
Vasil Simeonov ◽  
Stefan Tsakovski

AbstractThe present paper deals with the application of classical and fuzzy principal components analysis to a large data set from coastal sediment analysis. Altogether 126 sampling sites from the Atlantic Coast of the USA are considered and at each site 16 chemical parameters are measured. It is found that four latent factors are responsible for the data structure (“natural”, “anthropogenic”, “bioorganic”, and “organic anthropogenic”). Additionally, estimating the scatter plots for factor scores revealed the similarity between the sampling sites. Geographical and urban factors are found to contribute to the sediment chemical composition. It is shown that the use of fuzzy PCA helps for better data interpretation especially in case of outliers.


2020 ◽  
Vol 48 (W1) ◽  
pp. W385-W394
Author(s):  
Federico Taverna ◽  
Jermaine Goveia ◽  
Tobias K Karakach ◽  
Shawez Khan ◽  
Katerina Rohlenova ◽  
...  

Abstract The amount of biological data, generated with (single cell) omics technologies, is rapidly increasing, thereby exacerbating bottlenecks in the data analysis and interpretation of omics experiments. Data mining platforms that facilitate non-bioinformatician experimental scientists to analyze a wide range of experimental designs and data types can alleviate such bottlenecks, aiding in the exploration of (newly generated or publicly available) omics datasets. Here, we present BIOMEX, a browser-based software, designed to facilitate the Biological Interpretation Of Multi-omics EXperiments by bench scientists. BIOMEX integrates state-of-the-art statistical tools and field-tested algorithms into a flexible but well-defined workflow that accommodates metabolomics, transcriptomics, proteomics, mass cytometry and single cell data from different platforms and organisms. The BIOMEX workflow is accompanied by a manual and video tutorials that provide the necessary background to navigate the interface and get acquainted with the employed methods. BIOMEX guides the user through omics-tailored analyses, such as data pretreatment and normalization, dimensionality reduction, differential and enrichment analysis, pathway mapping, clustering, marker analysis, trajectory inference, meta-analysis and others. BIOMEX is fully interactive, allowing users to easily change parameters and generate customized plots exportable as high-quality publication-ready figures. BIOMEX is open source and freely available at https://www.vibcancer.be/software-tools/biomex.


2017 ◽  
Vol 3 (4) ◽  
pp. 265
Author(s):  
Widyartini Made Sudania ◽  
Zulfikar Achmad Tanjung ◽  
Nurita Toruan-Mathius ◽  
Tony Liwang

<p class="Els-Abstract-text">The application of DNA sequencing technologies has a major impact on molecular biology, especially in understanding genes interaction in a certain condition. Due to a large number of genes produced by this high-throughput technology, a proper analysis tool is needed for data interpretation. ClueGO is a bioinformatics tool, an easy to use Cytoscape plug-in that strongly improves biological function interpretation of genes. It analyzes a cluster or comparing two clusters and comprehensively visualizes their group functions. This tool is applied to identify biological networks of genes involved in embryogenesis of oil palm, the most critical phase in oil palm tissue culture process. Two ESTs sequencing data from the GenBank database under accession number EY396120-EY413718 and DW247764-DW248770 were used in this study. Fifty-two and one hundred eight groups of genes were identified using biological process in Gene Ontology setting from the database of EY396120-EY413718 and DW247764-DW248770, respectively. Thirty-one groups of genes were consistently occurred in both ESTs. According to the literature, these genes play an important role in cell formations and developments, stresses and stimulus responses, photosynthesis and metabolic processes that indicate the involvement of these groups of genes in oil palm embryogenesis processes. ClueGO is the appropriate tool to analyze a large data set of genes in a specific condition, such as embryogenesis of oil palm.</p><div><p class="Els-keywords"><em> </em></p></div><strong>Keywords:</strong> callus embryogenesis; Cytoscape plug-in; DNA sequencing; expressed sequence tag; KEGG pathways


2014 ◽  
Author(s):  
R Daniel Kortschak ◽  
David L Adelson

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.


2019 ◽  
Vol 29 (1) ◽  
pp. 169-178
Author(s):  
Anna Papiez ◽  
Christophe Badie ◽  
Joanna Polanska

Abstract The focus of this research is to combine statistical and machine learning tools in application to a high-throughput biological data set on ionizing radiation response. The analyzed data consist of two gene expression sets obtained in studies of radiosensitive and radioresistant breast cancer patients undergoing radiotherapy. The data sets were similar in principle; however, the treatment dose differed. It is shown that introducing mathematical adjustments in data preprocessing, differentiation and trend testing, and classification, coupled with current biological knowledge, allows efficient data analysis and obtaining accurate results. The tools used to customize the analysis workflow were batch effect filtration with empirical Bayes models, identifying gene trends through the Jonckheere–Terpstra test and linear interpolation adjustment according to specific gene profiles for multiple random validation. The application of non-standard techniques enabled successful sample classification at the rate of 93.5% and the identification of potential biomarkers of radiation response in breast cancer, which were confirmed with an independent Monte Carlo feature selection approach and by literature references. This study shows that using customized analysis workflows is a necessary step towards novel discoveries in complex fields such as personalized individual therapy.


Author(s):  
Nikitas Papangelopoulos ◽  
Dimitrios Vlachakis ◽  
Arianna Filntisi ◽  
Paraskevas Fakourelis ◽  
Louis Papageorgiou ◽  
...  

The exponential growth of available biological data in recent years coupled with their increasing complexity has made their analysis a computationally challenging process. Traditional central processing unist (CPUs) are reaching their limit in processing power and are not designed primarily for multithreaded applications. Graphics processing units (GPUs) on the other hand are affordable, scalable computer powerhouses that, thanks to the ever increasing demand for higher quality graphics, have yet to reach their limit. Typically high-end CPUs have 8-16 cores, whereas GPUs can have more than 2,500 cores. GPUs are also, by design, highly parallel, multicore and multithreaded, able of handling thousands of threads doing the same calculation on different subsets of a large data set. This ability is what makes them perfectly suited for biological analysis tasks. Lately this potential has been realized by many bioinformatics researches and a huge variety of tools and algorithms have been ported to GPUs, or designed from the ground up to maximize the usage of available cores. Here, we present a comprehensive review of available bioinformatics tools ranging from sequence and image analysis to protein structure prediction and systems biology that use NVIDIA Compute Unified Device Architecture (CUDA) general-purpose computing on graphics processing units (GPGPU) programming language.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Wim De Mulder ◽  
Martin Kuiper ◽  
René Boel

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 136
Author(s):  
Rutger A. Vos ◽  
Toshiaki Katayama ◽  
Hiroyuki Mishima ◽  
Shin Kawano ◽  
Shuichi Kawashima ◽  
...  

We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.


Sign in / Sign up

Export Citation Format

Share Document