Random forest based similarity learning for single cell RNA sequencing data

Mapping Intimacies ◽

10.1101/258699 ◽

2018 ◽

Author(s):

Maziyar Baran Pouyan ◽

Dennis Kostka

Keyword(s):

Data Analysis ◽

Random Forest ◽

Single Cells ◽

R Package ◽

Similarity Learning ◽

Sequencing Data ◽

Genome Wide ◽

Step Procedure ◽

Exploratory Data ◽

Cell Cell

AbstractMotivationGenome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore obtaining accurate cell–cell similarities from scRNA-seq data is critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.ResultsHere we present RAFSIL, a random forest based approach to learn cell–cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization, and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.Availability and ImplementationThe RAFSIL R package is available online at www.kostkalab.net/software.html

Download Full-text

Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies

Computational Methods for Next Generation Sequencing Data Analysis ◽

10.1002/9781119272182.ch9 ◽

2016 ◽

pp. 197-226

Author(s):

Jeong-Hyeon Choi ◽

Huidong Shi

Keyword(s):

Dna Methylation ◽

Data Analysis ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Computational Approaches ◽

Genome Wide ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data

10.1101/2020.10.14.340299 ◽

2020 ◽

Author(s):

Junpeng Zhang ◽

Lin Liu ◽

Taosheng Xu ◽

Wu Zhang ◽

Chunwen Zhao ◽

...

Keyword(s):

Single Cell ◽

Regulatory Networks ◽

Single Cells ◽

Small Scale ◽

Mirna Regulation ◽

Sequencing Data ◽

Resolution Level ◽

Novel Strategy ◽

Cell Cell

AbstractBackgroundExisting computational methods for studying miRNA regulation are mostly based on bulk miRNA and mRNA expression data. However, bulk data only allows the analysis of miRNA regulation regarding a group of cells, rather than the miRNA regulation unique to individual cells. Recent advance in single-cell miRNA-mRNA co-sequencing technology has opened a way for investigating miRNA regulation at single-cell level. However, as currently single-cell miRNA-mRNA co-sequencing data is just emerging and only available at small-scale, there is a strong need of novel methods to exploit existing single-cell data for the study of cell-specific miRNA regulation.ResultsIn this work, we propose a new method, CSmiR (Cell-Specific miRNA regulation) to use single-cell miRNA-mRNA co-sequencing data to identify miRNA regulatory networks at the resolution of individual cells. We apply CSmiR to the miRNA-mRNA co-sequencing data in 19 K562 single-cells to identify cell-specific miRNA-mRNA regulatory networks to understand miRNA regulation in each K562 single-cell. By analyzing the obtained cell-specific miRNA-mRNA regulatory networks, we observe that the miRNA regulation in each K562 single-cell is unique. Moreover, we conduct detailed analysis on the cell-specific miRNA regulation associated with the miR-17/92 family as a case study. Finally, through exploring cell-cell similarity matrix characterized by cell-specific miRNA regulation, CSmiR provides a novel strategy for clustering single-cells to help understand cell-cell crosstalk.ConclusionsTo the best of our knowledge, CSmiR is the first method to explore miRNA regulation at a single-cell resolution level, and we believe that it can be a useful method to enhance the understanding of cell-specific miRNA regulation.

Download Full-text

SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis

10.1101/2021.06.29.450255 ◽

2021 ◽

Author(s):

Hasindu Gamaarachchi ◽

Hiruna Samarakoon ◽

Sasha P. Jenner ◽

James M Ferguson ◽

Timothy G. Amos ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Nanopore Sequencing ◽

File Format ◽

Sequencing Data ◽

Genome Wide ◽

Wide Range ◽

File Structure ◽

Order Of Magnitude ◽

Methylation Profiling

Nanopore sequencing is an emerging genomic technology with great potential. However, the storage and analysis of nanopore sequencing data have become major bottlenecks preventing more widespread adoption in research and clinical genomics. Here, we elucidate an inherent limitation in the file format used to store raw nanopore data, known as FAST5, that prevents efficient analysis on high-performance computing (HPC) systems. To overcome this we have developed SLOW5, an alternative file format that permits efficient parallelisation and, thereby, acceleration of nanopore data analysis. For example, we show that using SLOW5 format, instead of FAST5, reduces the time and cost of genome-wide DNA methylation profiling by an order of magnitude on common HPC systems, and delivers consistent improvements on a wide range of different architectures. With a simple, accessible file structure and a ~25% reduction in size compared to FAST5, SLOW5 format will deliver substantial benefits to all areas of the nanopore community.

Download Full-text

rANOMALY: AmplicoN wOrkflow for Microbial community AnaLYsis

F1000Research ◽

10.12688/f1000research.27268.1 ◽

2021 ◽

Vol 10 ◽

pp. 7

Author(s):

Sebastien Theil ◽

Etienne Rifa

Keyword(s):

Data Analysis ◽

Microbial Community ◽

Statistical Tests ◽

Marker Gene ◽

R Package ◽

Microbial Community Analysis ◽

Differential Analysis ◽

Sequencing Data ◽

Statistical Validation ◽

Bioinformatic Tools

Bioinformatic tools for marker gene sequencing data analysis are continuously and rapidly evolving, thus integrating most recent techniques and tools is challenging. We present an R package for data analysis of 16S and ITS amplicons based sequencing. This workflow is based on several R functions and performs automatic treatments from fastq sequence files to diversity and differential analysis with statistical validation. The main purpose of this package is to automate bioinformatic analysis, ensure reproducibility between projects, and to be flexible enough to quickly integrate new bioinformatic tools or statistical methods. rANOMALY is an easy to install and customizable R package, that uses amplicon sequence variants (ASV) level for microbial community characterization. It integrates all assets of the latest bioinformatics methods, such as better sequence tracking, decontamination from control samples, use of multiple reference databases for taxonomic annotation, all main ecological analysis for which we propose advanced statistical tests, and a cross-validated differential analysis by four different methods. Our package produces ready to publish figures, and all of its outputs are made to be integrated in Rmarkdown code to produce automated reports.

Download Full-text

Random forest based similarity learning for single cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/bty260 ◽

2018 ◽

Vol 34 (13) ◽

pp. i79-i88 ◽

Cited By ~ 12

Author(s):

Maziyar Baran Pouyan ◽

Dennis Kostka

Keyword(s):

Random Forest ◽

Single Cell ◽

Rna Sequencing ◽

Similarity Learning ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Download Full-text

scds: computational annotation of doublets in single-cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz698 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Original Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Pilot study of changes in the level of piRNA in plasma and serum in women at different stages of physiological pregnancy

The Siberian Journal of Clinical and Experimental Medicine ◽

10.29001/2073-8552-2021-36-4-62-69 ◽

2022 ◽

Vol 36 (4) ◽

pp. 62-69

Author(s):

A. S. Glotov ◽

P. Yu. Kozyulina ◽

E. S. Vashukova ◽

R. A. Illarionov ◽

N. O. Yurkina ◽

...

Keyword(s):

Data Analysis ◽

Pregnant Women ◽

Small Rnas ◽

Web Application ◽

R Package ◽

Sequencing Data ◽

Third Trimester ◽

Singleton Pregnancy ◽

Pilot Work ◽

Analysis Center

Aim. To study changes in the level of piRNA in plasma and serum of pregnant women at different stages of gestation.Material and Methods. A total of 42 samples of plasma and blood serum were obtained from seven women with physiological singleton pregnancy without obstetric and gynecological pathology. The study was carried out at three time points corresponding to 8–13, 18–25, and 30–35 weeks of pregnancy, respectively. To assess the spectrum and levels of piRNA by the NGS method, whole genome sequencing of small RNAs was carried out. Sequencing data analysis was performed using the GeneGlobe Data Analysis Center web application. Differential expression was assessed using the DESeq2 R package.Results and Discussion. The piRNA contents among all small RNAs were 2.29%, 2.61%, and 4.16% in plasma and 7.29%, 7.02%, and 10.82% in serum during the first, second, and third trimesters, respectively. The contents of the following piRNAs increased in blood plasma from the first to the third trimester: piR 000765, piR 020326, piR 019825, piR 020497, piR 015026, piR 001312, and piR 017716. The study showed that the levels of piR 000765, piR 020326, piR 019825, piR 015026, piR 020497, piR 001312, piR 017716, and piR 004153 were significantly higher in serum compared with the corresponding values in plasma whereas the content of only one molecule, piR 018849, was higher in plasma.Conclusion. This pilot work created a basis for understanding the processes of piRNA expression in plasma and serum of pregnant women and can become the foundation for the search for biomarkers of various complications in pregnancy.

Download Full-text

Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04498-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Junpeng Zhang ◽

Lin Liu ◽

Taosheng Xu ◽

Wu Zhang ◽

Chunwen Zhao ◽

...

Keyword(s):

Single Cell ◽

Regulatory Networks ◽

Single Cells ◽

Small Scale ◽

Mirna Regulation ◽

Sequencing Data ◽

Comparison Results ◽

Resolution Level ◽

Novel Strategy ◽

Cell Cell

Abstract Background Existing computational methods for studying miRNA regulation are mostly based on bulk miRNA and mRNA expression data. However, bulk data only allows the analysis of miRNA regulation regarding a group of cells, rather than the miRNA regulation unique to individual cells. Recent advance in single-cell miRNA-mRNA co-sequencing technology has opened a way for investigating miRNA regulation at single-cell level. However, as currently single-cell miRNA-mRNA co-sequencing data is just emerging and only available at small-scale, there is a strong need of novel methods to exploit existing single-cell data for the study of cell-specific miRNA regulation. Results In this work, we propose a new method, CSmiR (Cell-Specific miRNA regulation) to combine single-cell miRNA-mRNA co-sequencing data and putative miRNA-mRNA binding information to identify miRNA regulatory networks at the resolution of individual cells. We apply CSmiR to the miRNA-mRNA co-sequencing data in 19 K562 single-cells to identify cell-specific miRNA-mRNA regulatory networks for understanding miRNA regulation in each K562 single-cell. By analyzing the obtained cell-specific miRNA-mRNA regulatory networks, we observe that the miRNA regulation in each K562 single-cell is unique. Moreover, we conduct detailed analysis on the cell-specific miRNA regulation associated with the miR-17/92 family as a case study. The comparison results indicate that CSmiR is effective in predicting cell-specific miRNA targets. Finally, through exploring cell–cell similarity matrix characterized by cell-specific miRNA regulation, CSmiR provides a novel strategy for clustering single-cells and helps to understand cell–cell crosstalk. Conclusions To the best of our knowledge, CSmiR is the first method to explore miRNA regulation at a single-cell resolution level, and we believe that it can be a useful method to enhance the understanding of cell-specific miRNA regulation.

Download Full-text

RETA: An R package for whole exome and targeted region sequencing data analysis

10.1101/121384 ◽

2017 ◽

Author(s):

Mengbiao Guo ◽

Jing Yang ◽

Yu lung Lau ◽

Wanling Yang

Keyword(s):

Data Analysis ◽

R Package ◽

Targeted Sequencing ◽

Sequencing Data ◽

Comprehensive Understanding ◽

Mendelian Diseases ◽

Whole Exome ◽

One Stop ◽

Sequencing Data Analysis ◽

Advanced Visualization

AbstractWhole exome and targeted sequencing have been playing a major role in diagnoses of Mendelian diseases, but analysis of these data involves using many complicated tools and comprehensive understanding of the analysis results is difficult.Here, we report RETA, an R package to provide a one-stop analysis of these data and a comprehensive, interactive and easy-to-understand report with many advanced visualization features. It facilitates clinicians and scientists alike to better analyze and interpret this type of sequencing data for disease diagnoses.Availability and implementationhttps://github.com/reta-s/reta/[email protected]

Download Full-text

DNAModAnnot: a R toolbox for DNA modification filtering and annotation

Bioinformatics ◽

10.1093/bioinformatics/btab032 ◽

2021 ◽

Author(s):

Alexis Hardy ◽

Mélody Matelot ◽

Amandine Touzeau ◽

Christophe Klopp ◽

Céline Lopez-Roques ◽

...

Keyword(s):

Global Analysis ◽

R Package ◽

Supplementary Information ◽

Dna Modification ◽

Paramecium Tetraurelia ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Dna Modifications ◽

Long Read

Abstract Motivation Long-read sequencing technologies can be employed to detect and map DNA modifications at the nucleotide resolution on a genome-wide scale. However, published software packages neglect the integration of genomic annotation and comprehensive filtering when analyzing patterns of modified bases detected using Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data. Here, we present DNAModAnnot, a R package designed for the global analysis of DNA modification patterns using adapted filtering and visualization tools. Results We tested our package using PacBio sequencing data to analyze patterns of the 6-methyladenine (6 mA) in the ciliate Paramecium tetraurelia, in which high 6 mA amounts were previously reported. We found Paramecium tetraurelia 6 mA genome-wide distribution to be similar to other ciliates. We also performed 5-methylcytosine (5mC) analysis in human lymphoblastoid cells using ONT data and confirmed previously known patterns of 5mC. DNAModAnnot provides a toolbox for the genome-wide analysis of different DNA modifications using PacBio and ONT long-read sequencing data. Availability DNAModAnnot is distributed as a R package available via GitHub (https://github.com/AlexisHardy/DNAModAnnot) Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text