CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Mapping Intimacies ◽

10.1101/2021.12.06.471436 ◽

2021 ◽

Author(s):

Shaopeng Liu ◽

David Koslicki

Keyword(s):

Data Structure ◽

Computational Biology ◽

Search Tree ◽

Reference Database ◽

Data Sets ◽

Running Time ◽

Link Type ◽

Optimal Value ◽

Computationally Expensive ◽

Resolution Estimation

AbstractK-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k = kmax value, we can simultaneously obtain k-mer based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. For example, we show that when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time is close to 10x faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.

Download Full-text

Improving the Red-Black tree delete algorithm

10.21203/rs.3.rs-1194654/v1 ◽

2021 ◽

Author(s):

ZEGOUR Djamel Eddine

Keyword(s):

Data Structure ◽

Search Tree ◽

Binary Search ◽

Binary Search Tree ◽

Search Performance ◽

Running Time ◽

Standard Algorithm ◽

Color Changes ◽

Red Black Tree ◽

Symbol Tables

Abstract Today, Red-Black trees are becoming a popular data structure typically used to implement dictionaries, associative arrays, symbol tables within some compilers (C++, Java …) and many other systems. In this paper, we present an improvement of the delete algorithm of this kind of binary search tree. The proposed algorithm is very promising since it colors differently the tree while reducing color changes by a factor of about 29%. Moreover, the maintenance operations re-establishing Red-Black tree balance properties are reduced by a factor of about 11%. As a consequence, the proposed algorithm saves about 4% on running time when insert and delete operations are used together while conserving search performance of the standard algorithm.

Download Full-text

Optimizing viral genome subsampling by genetic diversity and temporal distribution (TARDiS) for Phylogenetics

10.1101/2021.01.15.426832 ◽

2021 ◽

Author(s):

Simone Marini ◽

Carla Mavian ◽

Alberto Riva ◽

Marco Salemi ◽

Brittany Rife Magalis

Keyword(s):

Genetic Algorithm ◽

Genetic Diversity ◽

Viral Genome ◽

Temporal Distribution ◽

Data Sets ◽

Link Type

AbstractTARDiS for Philogenetics is a novel tool for optimal genetic sub-sampling. It optimizes both genetic diversity and temporal distribution through a genetic algorithm. TARDiS, along with example data sets and a user manual, is available at https://github.com/smarini/tardis-phylogenetics

Download Full-text

Optimal Purely Functional Priority Queues

BRICS Report Series ◽

10.7146/brics.v3i37.20019 ◽

1996 ◽

Vol 3 (37) ◽

Author(s):

Gerth Stølting Brodal ◽

Chris Okasaki

Keyword(s):

Data Structure ◽

Higher Order ◽

Priority Queues ◽

Worst Case ◽

Minimum Element ◽

Asymptotically Optimal ◽

Running Time ◽

Recursive Structures ◽

Functional Setting

Brodal recently introduced the first implementation of imperative priority queues to support findMin, insert, and meld in O(1) worst-case time, and deleteMin in O(log n) worst-case time. These bounds are asymptotically optimal among all comparison-based priority queues. In this paper, we adapt<br />Brodal's data structure to a purely functional setting. In doing so, we both simplify the data structure and clarify its relationship to the binomial queues of Vuillemin, which support all four operations in O(log n) time. Specifically, we derive our implementation from binomial queues in three steps: first, we reduce the running time of insert to O(1) by eliminating the possibility of cascading links; second, we reduce the running time of findMin to O(1) by adding a global root to hold the minimum element; and finally, we reduce the running time of meld to O(1) by allowing priority queues to contain other<br />priority queues. Each of these steps is expressed using ML-style functors. The last transformation, known as data-structural bootstrapping, is an interesting<br />application of higher-order functors and recursive structures.

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Quickomics: exploring omics data in an intuitive, interactive and informative manner

10.1101/2021.01.19.427296 ◽

2021 ◽

Author(s):

Benbo Gao ◽

Jing Zhu ◽

Soumya Negi ◽

Xinmin Zhang ◽

Stefka Gyoneva ◽

...

Keyword(s):

Modular Design ◽

Functional Module ◽

Supplementary Information ◽

Data Sets ◽

Omics Data ◽

Proteomics Data ◽

Primary Analysis ◽

Link Type ◽

R Shiny ◽

Advanced Analysis

AbstractSummaryWe developed Quickomics, a feature-rich R Shiny-powered tool to enable biologists to fully explore complex omics data and perform advanced analysis in an easy-to-use interactive interface. It covers a broad range of secondary and tertiary analytical tasks after primary analysis of omics data is completed. Each functional module is equipped with customized configurations and generates both interactive and publication-ready high-resolution plots to uncover biological insights from data. The modular design makes the tool extensible with ease.AvailabilityResearchers can experience the functionalities with their own data or demo RNA-Seq and proteomics data sets by using the app hosted at http://quickomics.bxgenomics.com and following the tutorial, https://bit.ly/3rXIyhL. The source code under GPLv3 license is provided at https://github.com/interactivereport/[email protected], [email protected] informationSupplementary materials are available at https://bit.ly/37HP17g.

Download Full-text

Microbiome Search Engine 2: a Platform for Taxonomic and Functional Search of Global Microbiomes on the Whole-Microbiome Level

mSystems ◽

10.1128/msystems.00943-20 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Gongchao Jing ◽

Lu Liu ◽

Zengbin Wang ◽

Yufeng Zhang ◽

Li Qian ◽

...

Keyword(s):

Big Data ◽

User Interface ◽

Search Engine ◽

Functional Similarity ◽

Metagenomic Data ◽

Data Sets ◽

Data Space ◽

Link Type ◽

Database Platform ◽

Microbiome Data

ABSTRACT Metagenomic data sets from diverse environments have been growing rapidly. To ensure accessibility and reusability, tools that quickly and informatively correlate new microbiomes with existing ones are in demand. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes in the global metagenome data space based on the taxonomic or functional similarity of a whole microbiome to those in the database. MSE 2 consists of (i) a well-organized and regularly updated microbiome database that currently contains over 250,000 metagenomic shotgun and 16S rRNA gene amplicon samples associated with unified metadata collected from 798 studies, (ii) an enhanced search engine that enables real-time and fast (<0.5 s per query) searches against the entire database for best-matched microbiomes using overall taxonomic or functional profiles, and (iii) a Web-based graphical user interface for user-friendly searching, data browsing, and tutoring. MSE 2 is freely accessible via http://mse.ac.cn. For standalone searches of customized microbiome databases, the kernel of the MSE 2 search engine is provided at GitHub (https://github.com/qibebt-bioinfo/meta-storms). IMPORTANCE A search-based strategy is useful for large-scale mining of microbiome data sets, such as a bird’s-eye view of the microbiome data space and disease diagnosis via microbiome big data. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes against the existing microbiome data sets on the basis of their similarity in taxonomic structure or functional profile. Key improvements include database extension, data compatibility, a search engine kernel, and a user interface. The new ability to search the microbiome space via functional similarity greatly expands the scope of search-based mining of the microbiome big data.

Download Full-text

2017 ISCB Accomplishment by a Senior Scientist Award: Pavel Pevzner

F1000Research ◽

10.12688/f1000research.11588.1 ◽

2017 ◽

Vol 6 ◽

pp. 1001

Author(s):

Christiana N. Fogg ◽

Diane E. Kovats ◽

Bonnie Berger

Keyword(s):

Computational Biology ◽

Intelligent Systems ◽

University Of California ◽

Joint Meeting ◽

Massachusetts Institute ◽

Massachusetts Institute Of Technology ◽

Senior Scientist ◽

Link Type ◽

Institute Of Technology ◽

Research Service

The International Society for Computational Biology (ISCB) recognizes an established scientist each year with the Accomplishment by a Senior Scientist Award for significant contributions he or she has made to the field. This award honors scientists who have contributed to the advancement of computational biology and bioinformatics through their research, service, and education work. Pavel Pevzner, PhD, Ronald R. Taylor Professor of Computer Science and Director of the NIH Center for Computational Mass Spectrometry at University of California, San Diego, has been selected as the winner of the 2017 Accomplishment by a Senior Scientist Award. The ISCB awards committee, chaired by Dr. Bonnie Berger of the Massachusetts Institute of Technology, selected Pevzner as the 2017 winner. Pevzner will receive his award and deliver a keynote address at the 2017 Intelligent Systems for Molecular Biology-European Conference on Computational Biology joint meeting (ISMB/ECCB 2017) held in Prague, Czech Republic from July 21-July 25, 2017. ISMB/ECCB is a biennial joint meeting that brings together leading scientists in computational biology and bioinformatics from around the globe.

Download Full-text

A scalable eigenspace-based fuzzy c-means for topic detection

Data Technologies and Applications ◽

10.1108/dta-11-2020-0262 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Hendri Murfi

Keyword(s):

Representation Learning ◽

Detection Methods ◽

Data Sets ◽

Topic Detection ◽

Data Set ◽

Content Type ◽

Running Time ◽

Fuzzy C Means ◽

Coherence Score ◽

Value Decomposition

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.

Download Full-text

Application of Computational Biology and Artificial Intelligence Technologies in Cancer Precision Drug Discovery

BioMed Research International ◽

10.1155/2019/8427042 ◽

2019 ◽

Vol 2019 ◽

pp. 1-15 ◽

Cited By ~ 2

Author(s):

Nagasundaram Nagarajan ◽

Edward K. Y. Yapp ◽

Nguyen Quoc Khanh Le ◽

Balu Kamaraj ◽

Abeer Mohammed Al-Subaie ◽

...

Keyword(s):

Artificial Intelligence ◽

Drug Discovery ◽

Computational Biology ◽

New Drugs ◽

Pharmaceutical Companies ◽

Data Sets ◽

Aggregated Data ◽

Usable Knowledge ◽

Knowledge Of Cancer ◽

Generation Sequencing

Artificial intelligence (AI) proves to have enormous potential in many areas of healthcare including research and chemical discoveries. Using large amounts of aggregated data, the AI can discover and learn further transforming these data into “usable” knowledge. Being well aware of this, the world’s leading pharmaceutical companies have already begun to use artificial intelligence to improve their research regarding new drugs. The goal is to exploit modern computational biology and machine learning systems to predict the molecular behaviour and the likelihood of getting a useful drug, thus saving time and money on unnecessary tests. Clinical studies, electronic medical records, high-resolution medical images, and genomic profiles can be used as resources to aid drug development. Pharmaceutical and medical researchers have extensive data sets that can be analyzed by strong AI systems. This review focused on how computational biology and artificial intelligence technologies can be implemented by integrating the knowledge of cancer drugs, drug resistance, next-generation sequencing, genetic variants, and structural biology in the cancer precision drug discovery.

Download Full-text

A Balanced Multiway Search Tree for Multi-Dimension Searching

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3574 ◽

2010 ◽

Vol 44-47 ◽

pp. 3574-3578

Author(s):

Ai Guo Li ◽

Chi Zhang ◽

Jiu Long Zhang ◽

Zhen Hai Zhang

Keyword(s):

Search Tree ◽

Experimental Results ◽

Extra Cost ◽

Leaf Node ◽

Index Structure ◽

Data Sets ◽

Comprehensive Performance ◽

Data Files ◽

Tree Index ◽

Tree Building

A new multi-dimensional index structure called RSR-tree is proposed, which based on RS-tree. In RSR-tree, index records of a leaf node are split to ensure the sequence ordering of index records in a leaf node, which reduces the addressing cost of I/O operations effectively when reading data files. The entries of a non-leaf node are split to decreases the overlap between the brother nodes, which reduces effectively the time of reading data from data files. Experimental results on different data sets show that compared to RS-tree, RSR-tree has better comprehensive performance, in regard to tree building and querying. The querying performance is increased and extra cost is not produced.

Download Full-text