Web Tools for Molecular Biological Data Analysis

Bioinformatics means solving problems arising from biology using methods from computer science. The National Center for Biotechnology Information (www.ncbi.nih.gov) defines bioinformatics as: “…the field of science in which biology, computer science, and information technology merge into a single discipline...There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to access relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.”

Download Full-text

BioNetLink - An Architecture for Working with Network Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-241 ◽

2014 ◽

Vol 11 (2) ◽

pp. 68-79

Author(s):

Matthias Klapperstück ◽

Falk Schreiber

Keyword(s):

Experimental Data ◽

Biological Networks ◽

Regulatory Networks ◽

Large Data ◽

Biological Data ◽

Network Data ◽

Data Sets ◽

Dynamic Views ◽

Gene Regulatory ◽

Multiple Network

Summary The visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Toward Psychoinformatics: Computer Science Meets Psychology

Computational and Mathematical Methods in Medicine ◽

10.1155/2016/2983685 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 40

Author(s):

Christian Montag ◽

Éilish Duke ◽

Alexander Markowetz

Keyword(s):

Computer Science ◽

Online Social Network ◽

Large Data ◽

Research Field ◽

Large Data Sets ◽

Future Research ◽

Data Sets ◽

Psychological Traits ◽

Scientific Methods ◽

Insight Into

The present paper provides insight into an emerging research discipline calledPsychoinformatics. In the context ofPsychoinformatics, we emphasize the cooperation between the disciplines of psychology and computer science in handling large data sets derived from heavily used devices, such as smartphones or online social network sites, in order to shed light on a large number of psychological traits, including personality and mood. New challenges await psychologists in light of the resulting “Big Data” sets, because classic psychological methods will only in part be able to analyze this data derived from ubiquitous mobile devices, as well as other everyday technologies. As a consequence, psychologists must enrich their scientific methods through the inclusion of methods from informatics. The paper provides a brief review of one area of this research field, dealing mainly with social networks and smartphones. Moreover, we highlight how data derived fromPsychoinformaticscan be combined in a meaningful way with data from human neuroscience. We close the paper with some observations of areas for future research and problems that require consideration within this new discipline.

Download Full-text

Understanding large and complex biological data sets using visualization

The Biochemist ◽

10.1042/bio_2021_165 ◽

2021 ◽

Author(s):

Stephen Taylor

Keyword(s):

Large Data ◽

Physical Structure ◽

Biological Data ◽

Large Data Sets ◽

Data Driven ◽

Data Sets ◽

Epigenetic Changes ◽

Mass Cytometry ◽

High Level ◽

Insight Into

Molecular biology experiments are generating an unprecedented amount of information from a variety of different experimental modalities. DNA sequencing machines, proteomics mass cytometry and microscopes generate huge amounts of data every day. Not only is the data large, but it is also multidimensional. Understanding trends and getting actionable insights from these data requires techniques that allow comprehension at a high level but also insight into what underlies these trends. Lots of small errors or poor summarization can lead to false results and reproducibility issues in large data sets. Hence it is essential we do not cherry-pick results to suit a hypothesis but instead examine all data and publish accurate insights in a data-driven way. This article will give an overview of some of the problems faced by the researcher in understanding epigenetic changes (which are related to changes in the physical structure of DNA) when presented with raw analysis results using visualization methods. We will also discuss the new challenges faced by using machine learning which can be helped by visualization.

Download Full-text

bíogo: a simple high-performance bioinformatics toolkit for the Go language

10.1101/005033 ◽

2014 ◽

Cited By ~ 6

Author(s):

R Daniel Kortschak ◽

David L Adelson

Keyword(s):

High Performance ◽

Large Scale ◽

Large Data ◽

Biological Data ◽

Data Sets ◽

Barriers To Entry ◽

Data Types ◽

Concurrent Processing ◽

Computationally Intensive ◽

And Performance

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.

Download Full-text

Competency Mining in Large Data Sets - Preparing Large Scale Investigations in Computer Science Education

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval ◽

10.5220/0005129203150322 ◽

2014 ◽

Cited By ~ 2

Author(s):

Peter Hubwieser ◽

Andreas Mühling

Keyword(s):

Science Education ◽

Computer Science ◽

Large Scale ◽

Computer Science Education ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

Bitmap Indices for Data Warehouses

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch091 ◽

2008 ◽

pp. 1590-1605

Author(s):

Kurt Stockinger ◽

Kesheng Wu

Keyword(s):

Time Complexity ◽

Large Data ◽

Database Systems ◽

Large Data Sets ◽

Future Research ◽

Data Sets ◽

Bitmap Index ◽

Access Method ◽

Efficient Access ◽

Efficient Query Processing

In this chapter we discuss various bitmap index technologies for efficient query processing in data warehousing applications. We review the existing literature and organize the technology into three categories, namely bitmap encoding, compression and binning. We introduce an efficient bitmap compression algorithm and examine the space and time complexity of the compressed bitmap index on large data sets from real applications. According to the conventional wisdom, bitmap indices are only efficient for low-cardinality attributes. However, we show that the compressed bitmap indices are also efficient for high-cardinality attributes. Timing results demonstrate that the bitmap indices significantly outperform the projection index, which is often considered to be the most efficient access method for multi-dimensional queries. Finally, we review the bitmap index technology currently supported by commonly used commercial database systems and discuss open issues for future research and development.

Download Full-text

Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)?

F1000Research ◽

10.12688/f1000research.14870.1 ◽

2018 ◽

Vol 7 ◽

pp. 1125 ◽

Cited By ~ 5

Author(s):

Marc Delarue ◽

Patrice Koehl

Keyword(s):

Protein Structure ◽

Computer Science ◽

Computer Architecture ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Statistical Physics ◽

High Throughput Sequencing ◽

Large Data ◽

Data Sets ◽

Protein Instability

Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.

Download Full-text

Interactive and coordinated visualization approaches for biological data analysis

Briefings in Bioinformatics ◽

10.1093/bib/bby019 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1513-1523 ◽

Cited By ~ 4

Author(s):

António Cruz ◽

Joel P Arrais ◽

Penousal Machado

Keyword(s):

Data Analysis ◽

Biological Data ◽

Data Sets ◽

Protein Protein Interaction ◽

Biological Data Analysis ◽

Time Series Gene Expression ◽

Protein Protein Interaction Networks ◽

Complex Relationships ◽

Meaningful Relationships ◽

Different Sources

AbstractThe field of computational biology has become largely dependent on data visualization tools to analyze the increasing quantities of data gathered through the use of new and growing technologies. Aside from the volume, which often results in large amounts of noise and complex relationships with no clear structure, the visualization of biological data sets is hindered by their heterogeneity, as data are obtained from different sources and contain a wide variety of attributes, including spatial and temporal information. This requires visualization approaches that are able to not only represent various data structures simultaneously but also provide exploratory methods that allow the identification of meaningful relationships that would not be perceptible through data analysis algorithms alone. In this article, we present a survey of visualization approaches applied to the analysis of biological data. We focus on graph-based visualizations and tools that use coordinated multiple views to represent high-dimensional multivariate data, in particular time series gene expression, protein–protein interaction networks and biological pathways. We then discuss how these methods can be used to help solve the current challenges surrounding the visualization of complex biological data sets.

Download Full-text