Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

The identification of reference genomes and taxonomic labels from metagenome data underlies many microbiome studies. Here we describe two algorithms for compositional analysis of metagenome sequencing data. We first investigate the FracMinHash sketching technique, a derivative of modulo hash that supports Jaccard containment estimation between sets of different sizes. We implement FracMinHash in the sourmash software, evaluate its accuracy, and demonstrate large-scale containment searches of metagenomes using 700,000 microbial reference genomes. We next frame shotgun metagenome compositional analysis as the problem of finding a minimum collection of reference genomes that "cover" the known k-mers in a metagenome, a minimum set cover problem. We implement a greedy approximate solution using FracMinHash sketches, and evaluate its accuracy for taxonomic assignment using a CAMI community benchmark. Finally, we show that the minimum metagenome cover can be used to guide the selection of reference genomes for read mapping. sourmash is available as open source software under the BSD 3-Clause license at github.com/dib-lab/sourmash/.

Download Full-text

M3S: a comprehensive model selection for multi-modal single-cell RNA sequencing data

BMC Bioinformatics ◽

10.1186/s12859-019-3243-1 ◽

2019 ◽

Vol 20 (S24) ◽

Cited By ~ 1

Author(s):

Yu Zhang ◽

Changlin Wan ◽

Pengcheng Wang ◽

Wennan Chang ◽

Yan Huo ◽

...

Keyword(s):

Gene Expression ◽

Model Selection ◽

Statistical Model ◽

Single Cell ◽

Differential Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Sequencing Data ◽

Differential Gene ◽

Selection Of

Abstract Background Various statistical models have been developed to model the single cell RNA-seq expression profiles, capture its multimodality, and conduct differential gene expression test. However, for expression data generated by different experimental design and platforms, there is currently lack of capability to determine the most proper statistical model. Results We developed an R package, namely Multi-Modal Model Selection (M3S), for gene-wise selection of the most proper multi-modality statistical model and downstream analysis, useful in a single-cell or large scale bulk tissue transcriptomic data. M3S is featured with (1) gene-wise selection of the most parsimonious model among 11 most commonly utilized ones, that can best fit the expression distribution of the gene, (2) parameter estimation of a selected model, and (3) differential gene expression test based on the selected model. Conclusion A comprehensive evaluation suggested that M3S can accurately capture the multimodality on simulated and real single cell data. An open source package and is available through GitHub at https://github.com/zy26/M3S.

Download Full-text

Sketch distance-based clustering of chromosomes for large genome database compression

BMC Genomics ◽

10.1186/s12864-019-6310-0 ◽

2019 ◽

Vol 20 (S10) ◽

Author(s):

Tao Tang ◽

Yuansheng Liu ◽

Buzhong Zhang ◽

Benyue Su ◽

Jinyan Li

Keyword(s):

Compression Ratio ◽

Large Scale ◽

Rapid Development ◽

Rice Genome ◽

Sequencing Data ◽

Genome Database ◽

Compression Algorithms ◽

Large Genome ◽

Reference Selection ◽

Reference Genomes

Abstract Background The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. Results We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. Conclusions The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.

Download Full-text

Kuhn–Munkres Parallel Genetic Algorithm for the Set Cover Problem and Its Application to Large-Scale Wireless Sensor Networks

IEEE Transactions on Evolutionary Computation ◽

10.1109/tevc.2015.2511142 ◽

2016 ◽

Vol 20 (5) ◽

pp. 695-710 ◽

Cited By ~ 44

Author(s):

Xin-Yuan Zhang ◽

Jun Zhang ◽

Yue-Jiao Gong ◽

Zhi-Hui Zhan ◽

Wei-Neng Chen ◽

...

Keyword(s):

Genetic Algorithm ◽

Wireless Sensor Networks ◽

Sensor Networks ◽

Large Scale ◽

Wireless Sensor ◽

Set Cover ◽

Parallel Genetic Algorithm ◽

Set Cover Problem ◽

Cover Problem

Download Full-text

Sequence variation aware genome references and read mapping with the variation graph toolkit

10.1101/234856 ◽

2017 ◽

Cited By ~ 12

Author(s):

Erik Garrison ◽

Jouni Sirén ◽

Adam M. Novak ◽

Glenn Hickey ◽

Jordan M. Eizenga ◽

...

Keyword(s):

Dna Sequence ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Read Mapping ◽

Dna Sequence Data ◽

Suffix Arrays ◽

Improved Accuracy ◽

Reference Genomes

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.

Download Full-text

High-quality carnivore genomes from roadkill samples enable species delimitation in aardwolf and bat-eared fox

10.1101/2020.09.15.297622 ◽

2020 ◽

Author(s):

Rémi Allio ◽

Marie-Ka Tilak ◽

Céline Scornavacca ◽

Nico L. Avenant ◽

Erwan Corre ◽

...

Keyword(s):

Large Scale ◽

Incomplete Lineage Sorting ◽

Cost Effective ◽

Genomic Diversity ◽

Sequencing Data ◽

Lineage Sorting ◽

Long Reads ◽

Genomic Studies ◽

Eastern And Southern Africa ◽

Reference Genomes

AbstractIn a context of ongoing biodiversity erosion, obtaining genomic resources from wildlife is becoming essential for conservation. The thousands of yearly mammalian roadkill could potentially provide a useful source material for genomic surveys. To illustrate the potential of this underexploited resource, we used roadkill samples to sequence reference genomes and study the genomic diversity of the bat-eared fox (Otocyon megalotis) and the aardwolf (Proteles cristata) for which subspecies have been defined based on similar disjunct distributions in Eastern and Southern Africa. By developing an optimized DNA extraction protocol, we successfully obtained long reads using the Oxford Nanopore Technologies (ONT) MinION device. For the first time in mammals, we obtained two reference genomes with high contiguity and gene completeness by combining ONT long reads with Illumina short reads using hybrid assembly. Based on re-sequencing data from few other roakill samples, the comparison of the genetic differentiation between our two pairs of subspecies to that of pairs of well-defined species across Carnivora showed that the two subspecies of aardwolf might warrant species status (P. cristata and P. septentrionalis), whereas the two subspecies of bat-eared fox might not. Moreover, using these data, we conducted demographic analyses that revealed similar trajectories between Eastern and Southern populations of both species, suggesting that their population sizes have been shaped by similar environmental fluctuations. Finally, we obtained a well resolved genome-scale phylogeny for Carnivora with evidence for incomplete lineage sorting among the three main arctoid lineages. Overall, our cost-effective strategy opens the way for large-scale population genomic studies and phylogenomics of mammalian wildlife using roadkill.

Download Full-text

Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools

Advances in Bioinformatics ◽

10.1155/2014/472045 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 20

Author(s):

Hong Tran ◽

Jacob Porter ◽

Ming-an Sun ◽

Hehuang Xie ◽

Liqing Zhang

Keyword(s):

Large Scale ◽

Developmental Stages ◽

Good Choice ◽

Environment Interaction ◽

Sequencing Data ◽

Read Mapping ◽

Short Read ◽

Short Reads ◽

Short Read Mapping ◽

Gene And Environment Interaction

Background. Large-scale bisulfite treatment and short reads sequencing technology allow comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. Aligning bisulfite short reads to a reference genome has been a challenging task. We compared five bisulfite short read mapping tools, BSMAP, Bismark, BS-Seeker, BiSS, and BRAT-BW, representing two classes of mapping algorithms (hash table and suffix/prefix tries). We examined their mapping efficiency (i.e., the percentage of reads that can be mapped to the genomes), usability, running time, and effects of changing default parameter settings using both real and simulated reads. We also investigated how preprocessing data might affect mapping efficiency. Conclusion. Among the five programs compared, in terms of mapping efficiency, Bismark performs the best on the real data, followed by BiSS, BSMAP, and finally BRAT-BW and BS-Seeker with very similar performance. If CPU time is not a constraint, Bismark is a good choice of program for mapping bisulfite treated short reads. Data quality impacts a great deal mapping efficiency. Although increasing the number of mismatches allowed can increase mapping efficiency, it not only significantly slows down the program, but also runs the risk of having increased false positives. Therefore, users should carefully set the related parameters depending on the quality of their sequencing data.

Download Full-text

Platelet Anti-Aggregating Activity and Tolerance of Clopidogrel in Atherosclerotic Patients

Thrombosis and Haemostasis ◽

10.1055/s-0038-1650689 ◽

1996 ◽

Vol 76 (06) ◽

pp. 0939-0943 ◽

Cited By ~ 74

Author(s):

B Boneu ◽

G Destelle ◽

Keyword(s):

Platelet Aggregation ◽

Large Scale ◽

Bleeding Time ◽

Antithrombotic Activity ◽

Ambulatory Patients ◽

Patients At Risk ◽

Set Up ◽

Ischemic Events ◽

Large Scale Clinical Trial ◽

Selection Of

SummaryThe anti-aggregating activity of five rising doses of clopidogrel has been compared to that of ticlopidine in atherosclerotic patients. The aim of this study was to determine the dose of clopidogrel which should be tested in a large scale clinical trial of secondary prevention of ischemic events in patients suffering from vascular manifestations of atherosclerosis [CAPRIE (Clopidogrel vs Aspirin in Patients at Risk of Ischemic Events) trial]. A multicenter study involving 9 haematological laboratories and 29 clinical centers was set up. One hundred and fifty ambulatory patients were randomized into one of the seven following groups: clopidogrel at doses of 10, 25, 50,75 or 100 mg OD, ticlopidine 250 mg BID or placebo. ADP and collagen-induced platelet aggregation tests were performed before starting treatment and after 7 and 28 days. Bleeding time was performed on days 0 and 28. Patients were seen on days 0, 7 and 28 to check the clinical and biological tolerability of the treatment. Clopidogrel exerted a dose-related inhibition of ADP-induced platelet aggregation and bleeding time prolongation. In the presence of ADP (5 \lM) this inhibition ranged between 29% and 44% in comparison to pretreatment values. The bleeding times were prolonged by 1.5 to 1.7 times. These effects were non significantly different from those produced by ticlopidine. The clinical tolerability was good or fair in 97.5% of the patients. No haematological adverse events were recorded. These results allowed the selection of 75 mg once a day to evaluate and compare the antithrombotic activity of clopidogrel to that of aspirin in the CAPRIE trial.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Methodology for Determining the Location of River Ports on a Modernized Waterway Based on Non-Cost Criteria: A Case Study of the Odra River Waterway

Sustainability ◽

10.3390/su13063571 ◽

2021 ◽

Vol 13 (6) ◽

pp. 3571

Author(s):

Bogusz Wiśnicki ◽

Dorota Dybkowska-Stefek ◽

Justyna Relisko-Rybak ◽

Łukasz Kolanda

Keyword(s):

Large Scale ◽

Multicriteria Decision Making ◽

Research Process ◽

Optimal Selection ◽

Investment Projects ◽

Odra River ◽

Research Problems ◽

Selection Of ◽

Construction Works

The paper responds to research problems related to the implementation of large-scale investment projects in waterways in Europe. As part of design and construction works, it is necessary to indicate river ports that play a major role within the European transport network as intermodal nodes. This entails a number of challenges, the cardinal one being the optimal selection of port locations, taking into account the new transport, economic, and geopolitical situation that will be brought about by modernized waterways. The aim of the paper was to present an original methodology for determining port locations for modernized waterways based on non-cost criteria, as an extended multicriteria decision-making method (MCDM) and employing GIS (Geographic Information System)-based tools for spatial analysis. The methodology was designed to be applicable to the varying conditions of a river’s hydroengineering structures (free-flowing river, canalized river, and canals) and adjustable to the requirements posed by intermodal supply chains. The method was applied to study the Odra River Waterway, which allowed the formulation of recommendations regarding the application of the method in the case of different river sections at every stage of the research process.

Download Full-text

BonMOLière: Small-Sized Libraries of Readily Purchasable Compounds, Optimized to Produce Genuine Hits in Biological Screens across the Protein Space

International Journal of Molecular Sciences ◽

10.3390/ijms22157773 ◽

2021 ◽

Vol 22 (15) ◽

pp. 7773

Author(s):

Neann Mathai ◽

Conrad Stork ◽

Johannes Kirchmair

Keyword(s):

Large Scale ◽

Computational Approach ◽

Large Sets ◽

Compound Libraries ◽

Wide Range ◽

Protein Space ◽

High Chance ◽

Large Scale Screening ◽

Early Drug ◽

Selection Of

Experimental screening of large sets of compounds against macromolecular targets is a key strategy to identify novel bioactivities. However, large-scale screening requires substantial experimental resources and is time-consuming and challenging. Therefore, small to medium-sized compound libraries with a high chance of producing genuine hits on an arbitrary protein of interest would be of great value to fields related to early drug discovery, in particular biochemical and cell research. Here, we present a computational approach that incorporates drug-likeness, predicted bioactivities, biological space coverage, and target novelty, to generate optimized compound libraries with maximized chances of producing genuine hits for a wide range of proteins. The computational approach evaluates drug-likeness with a set of established rules, predicts bioactivities with a validated, similarity-based approach, and optimizes the composition of small sets of compounds towards maximum target coverage and novelty. We found that, in comparison to the random selection of compounds for a library, our approach generates substantially improved compound sets. Quantified as the “fitness” of compound libraries, the calculated improvements ranged from +60% (for a library of 15,000 compounds) to +184% (for a library of 1000 compounds). The best of the optimized compound libraries prepared in this work are available for download as a dataset bundle (“BonMOLière”).

Download Full-text