Digestiflow: from BCL to FASTQ with ease

10.7287/peerj.preprints.27717v4 ◽

2019 ◽

Author(s):

Manuel Holtgrewe ◽

Mikko Nieminen ◽

Clemens Messerschmidt ◽

Dieter Beule

Keyword(s):

Quality Control ◽

Software Package ◽

Flow Cell ◽

Automated Extraction ◽

Sequencing Data ◽

Report Generation ◽

Raw Data ◽

Cell Sample ◽

Cell Data

Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.

Download Full-text

Digestiflow: from BCL to FASTQ with ease

10.7287/peerj.preprints.27717 ◽

2019 ◽

Author(s):

Manuel Holtgrewe ◽

Mikko Nieminen ◽

Clemens Messerschmidt ◽

Dieter Beule

Keyword(s):

Quality Control ◽

Software Package ◽

Flow Cell ◽

Automated Extraction ◽

Sequencing Data ◽

Report Generation ◽

Raw Data ◽

Cell Sample ◽

Cell Data

Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.

Download Full-text

DigestiFlow - reproducible demultiplexing for the single cell era

10.7287/peerj.preprints.27717v3 ◽

2019 ◽

Author(s):

Manuel Holtgrewe ◽

Mikko Nieminen ◽

Clemens Messerschmidt ◽

Dieter Beule

Keyword(s):

Single Cell ◽

Information Management ◽

Software Package ◽

Flow Cell ◽

Software Components ◽

Automated Extraction ◽

Sequencing Data ◽

Raw Data ◽

Cell Sample

Summary. Managing raw sequencing data and conversion into sequences (demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions range from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. Namely, it allows for automated extraction of flow cell raw data information, management of sample sheets, and the automated (and thus reproducible) demultiplexing of Illumina base calls data. Availability and Implementation. The software is available under the MIT license at https://github.com/bihealth/digestiflow-server. The client and demux software components are available via Bioconda.

Download Full-text

SGTK: a toolkit for visualization and assessment of scaffold graphs

Bioinformatics ◽

10.1093/bioinformatics/bty956 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2303-2305 ◽

Cited By ~ 2

Author(s):

Olga Kunyavskaya ◽

Andrey D Prjibelski

Keyword(s):

Software Package ◽

Supplementary Information ◽

Sequencing Data ◽

Software Developers ◽

Long Reads ◽

Mate Pair ◽

Linkage Information ◽

Assembly Pipeline ◽

Genome Assemblies ◽

Assembly Software

Abstract Summary Scaffolding is an important step in every genome assembly pipeline, which allows to order contigs into longer sequences using various types of linkage information, such as mate-pair libraries and long reads. In this work, we operate with a notion of a scaffold graph—a graph, vertices of which correspond to the assembled contigs and edges represent connections between them. We present a software package called Scaffold Graph ToolKit that allows to construct and visualize scaffold graphs using different kinds of sequencing data. We show that the scaffold graph appears to be useful for analyzing and assessing genome assemblies, and demonstrate several use cases that can be helpful for both assembly software developers and their users. Availability and implementation SGTK is implemented in C++, Python and JavaScript and is freely available at https://github.com/olga24912/SGTK. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SCHNEL: Scalable clustering of high dimensional single-cell data

10.1101/2020.03.30.015925 ◽

2020 ◽

Cited By ~ 1

Author(s):

Tamim Abdelaal ◽

Paul de Raadt ◽

Boudewijn P.F. Lelieveldt ◽

Marcel J.T. Reinders ◽

Ahmed Mahfouz

Keyword(s):

Single Cell ◽

Graph Clustering ◽

Original Data ◽

Large Datasets ◽

Supplementary Information ◽

High Dimensional ◽

Sequencing Data ◽

Novel Approach ◽

Cellular Markers ◽

Cell Data

AbstractMotivationSingle cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.ResultsWe developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable timeframes. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.Availability and ImplementationImplementation is available on GitHub (https://github.com/paulderaadt/HSNE-clustering)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

SCHNEL: scalable clustering of high dimensional single-cell data

Bioinformatics ◽

10.1093/bioinformatics/btaa816 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i849-i856

Author(s):

Tamim Abdelaal ◽

Paul de Raadt ◽

Boudewijn P F Lelieveldt ◽

Marcel J T Reinders ◽

Ahmed Mahfouz

Keyword(s):

Single Cell ◽

Graph Clustering ◽

Original Data ◽

Large Datasets ◽

Supplementary Information ◽

High Dimensional ◽

Sequencing Data ◽

Novel Approach ◽

Cellular Markers ◽

Cell Data

Abstract Motivation Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. Results We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. Availability and implementation Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using single-cell cytometry to illustrate integrated multi-perspective evaluation of clustering algorithms using Pareto fronts

Bioinformatics ◽

10.1093/bioinformatics/btab038 ◽

2021 ◽

Author(s):

Givanna H Putri ◽

Irena Koprinska ◽

Thomas M Ashhurst ◽

Nicholas J C King ◽

Mark N Read

Keyword(s):

Single Cell ◽

Performance Metrics ◽

Clustering Algorithms ◽

Latin Hypercube Sampling ◽

Supplementary Information ◽

Sequencing Data ◽

Evaluation Protocol ◽

Benchmark Datasets ◽

Pareto Fronts ◽

Parameter Values

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

484 Bioturing browser: interactively explore public single cell sequencing data

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0484 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A520-A520

Author(s):

Son Pham ◽

Tri Le ◽

Tan Phan ◽

Minh Pham ◽

Huy Nguyen ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Expression Profiles ◽

Meta Analysis ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Data Formats ◽

Cancer Types ◽

Cell Data

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A

Download Full-text

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

Applied and Environmental Microbiology ◽

10.1128/aem.01541-09 ◽

2009 ◽

Vol 75 (23) ◽

pp. 7537-7541 ◽

Cited By ~ 11597

Author(s):

Patrick D. Schloss ◽

Sarah L. Westcott ◽

Thomas Ryabin ◽

Justine R. Hall ◽

Martin Hartmann ◽

...

Keyword(s):

16S Rrna ◽

Software Package ◽

Sequence Data ◽

Rrna Gene ◽

Sequencing Data ◽

Laptop Computer ◽

Operational Taxonomic Units ◽

Β Diversity ◽

Single Piece

ABSTRACT mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.

Download Full-text

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq343 ◽

2010 ◽

Vol 26 (17) ◽

pp. 2101-2108 ◽

Cited By ~ 27

Author(s):

Jiří Macas ◽

Pavel Neumann ◽

Petr Novák ◽

Jiming Jiang

Keyword(s):

Large Scale ◽

Rice Genome ◽

Supplementary Information ◽

Sequencing Data ◽

Satellite Repeat ◽

Frequency Spectra ◽

Consensus Sequences ◽

Chip Sequencing ◽

Conserved Sequence ◽

Centromeric Satellite

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text