Fast analysis of scATAC-seq data using a predefined set of genomic regions

Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. Results: We found that kallisto does not introduce biases in quantification of known peaks and cells groups are identified in a consistent way. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations

Download Full-text

Fast analysis of scATAC-seq data using a predefined set of genomic regions

F1000Research ◽

10.12688/f1000research.22731.2 ◽

2020 ◽

Vol 9 ◽

pp. 199 ◽

Cited By ~ 2

Author(s):

Valentina Giansanti ◽

Ming Tang ◽

Davide Cittaro

Keyword(s):

De Novo ◽

Marker Genes ◽

Fast Analysis ◽

Public Data ◽

Cell Groups ◽

Hypersensitive Sites ◽

Reliable Quantification ◽

Genomic Regions ◽

Computational Resources ◽

Cell Data

Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. We subsequently tested our approach on scATAC-seq data for K562 cell line. Results: We found that kallisto does not introduce biases in quantification of known peaks; cells groups identified are consistent with the ones identified from standard method. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.

Download Full-text

Mapping single-cell data to reference atlases by transfer learning

Nature Biotechnology ◽

10.1038/s41587-021-01001-7 ◽

2021 ◽

Cited By ~ 2

Author(s):

Mohammad Lotfollahi ◽

Mohsen Naghipourfar ◽

Malte D. Luecken ◽

Matin Khajavi ◽

Maren Büttner ◽

...

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Learning Strategy ◽

De Novo ◽

Specific Cell ◽

Batch Effects ◽

Raw Data ◽

Computational Resources ◽

Cell Data ◽

Biological State

AbstractLarge single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.

Download Full-text

Publisher Correction: Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence

Nature Genetics ◽

10.1038/s41588-021-00905-z ◽

2021 ◽

Author(s):

Jakob M. Goldmann ◽

Vladimir B. Seplyarskiy ◽

Wendy S. W. Wong ◽

Thierry Vilboux ◽

Pieter B. Neerincx ◽

...

Keyword(s):

De Novo ◽

Strand Break ◽

Double Strand Break ◽

De Novo Mutation ◽

Oocyte Aging ◽

Genomic Regions

Download Full-text

Targeted sequencing and integrative analysis to prioritize candidate genes in neurodevelopmental disorders

Molecular Neurobiology ◽

10.1007/s12035-021-02377-y ◽

2021 ◽

Author(s):

Yi Zhang ◽

Tao Wang ◽

Yan Wang ◽

Kun Xia ◽

Jinchen Li ◽

...

Keyword(s):

Candidate Genes ◽

Neurodevelopmental Disorders ◽

De Novo ◽

Genetic Data ◽

Chinese Patients ◽

The Novel ◽

Mutational Spectrum ◽

Functional Variants ◽

Public Data ◽

Chromosomal Genes

AbstractNeurodevelopmental disorders (NDDs) are a group of diseases characterized by high heterogeneity and frequently co-occurring symptoms. The mutational spectrum in patients with NDDs is largely incomplete. Here, we sequenced 547 genes from 1102 patients with NDDs and validated 1271 potential functional variants, including 108 de novo variants (DNVs) in 78 autosomal genes and seven inherited hemizygous variants in six X chromosomal genes. Notably, 36 of these 78 genes are the first to be reported in Chinese patients with NDDs. By integrating our genetic data with public data, we prioritized 212 NDD candidate genes with FDR < 0.1, including 17 novel genes. The novel candidate genes interacted or were co-expressed with known candidate genes, forming a functional network involved in known pathways. We highlighted MSL2, which carried two de novo protein-truncating variants (p.L192Vfs*3 and p.S486Ifs*11) and was frequently connected with known candidate genes. This study provides the mutational spectrum of NDDs in China and prioritizes 212 NDD candidate genes for further functional validation and genetic counseling.

Download Full-text

Long-Term Waterlogging as Factor Contributing to Hypoxia Stress Tolerance Enhancement in Cucumber: Comparative Transcriptome Analysis of Waterlogging Sensitive and Tolerant Accessions

Genes ◽

10.3390/genes12020189 ◽

2021 ◽

Vol 12 (2) ◽

pp. 189

Author(s):

Kinga Kęska ◽

Michał Wojciech Szcześniak ◽

Izabela Makałowska ◽

Małgorzata Czernicka

Keyword(s):

Stress Tolerance ◽

De Novo ◽

Average Length ◽

Marker Genes ◽

Long Term Memory ◽

Low Oxygen ◽

Waterlogging Tolerance ◽

Pre Treatment ◽

Waterlogging Treatment

Waterlogging (WL), excess water in the soil, is a phenomenon often occurring during plant cultivation causing low oxygen levels (hypoxia) in the soil. The aim of this study was to identify candidate genes involved in long-term waterlogging tolerance in cucumber using RNA sequencing. Here, we also determined how waterlogging pre-treatment (priming) influenced long-term memory in WL tolerant (WL-T) and WL sensitive (WL-S) i.e., DH2 and DH4 accessions, respectively. This work uncovered various differentially expressed genes (DEGs) activated in the long-term recovery in both accessions. De novo assembly generated 36,712 transcripts with an average length of 2236 bp. The results revealed that long-term waterlogging had divergent impacts on gene expression in WL-T DH2 and WL-S DH4 cucumber accessions: after 7 days of waterlogging, more DEGs in comparison to control conditions were identified in WL-S DH4 (8927) than in WL-T DH2 (5957). Additionally, 11,619 and 5007 DEGs were identified after a second waterlogging treatment in the WL-S and WL-T accessions, respectively. We identified genes associated with WL in cucumber that were especially related to enhanced glycolysis, adventitious roots development, and amino acid metabolism. qRT-PCR assay for hypoxia marker genes i.e., alcohol dehydrogenase (adh), 1-aminocyclopropane-1-carboxylate oxidase (aco) and long chain acyl-CoA synthetase 6 (lacs6) confirmed differences in response to waterlogging stress between sensitive and tolerant cucumbers and effectiveness of priming to enhance stress tolerance.

Download Full-text

Population genomics of parallel hybrid zones in the mimetic butterflies, H. melpomene and H. erato

10.1101/000208 ◽

2013 ◽

Author(s):

Nicola Nadeau ◽

Mayte Ruiz ◽

Patricio Salazar ◽

Brian Counterman ◽

Jose Alejandro Medina ◽

...

Keyword(s):

Population Genomics ◽

De Novo ◽

Reference Sequence ◽

Colour Pattern ◽

Adaptive Divergence ◽

Population Divergence ◽

Hybrid Zones ◽

Data Alignment ◽

Parallel Hybrid ◽

Genomic Regions

Hybrid zones can be valuable tools for studying evolution and identifying genomic regions responsible for adaptive divergence and underlying phenotypic variation. Hybrid zones between subspecies of Heliconius butterflies can be very narrow and are maintained by strong selection acting on colour pattern. The co-mimetic species H. erato and H. melpomene have parallel hybrid zones where both species undergo a change from one colour pattern form to another. We use restriction associated DNA sequencing to obtain several thousand genome wide sequence markers and use these to analyse patterns of population divergence across two pairs of parallel hybrid zones in Peru and Ecuador. We compare two approaches for analysis of this type of data; alignment to a reference genome and de novo assembly, and find that alignment gives the best results for species both closely (H. melpomene) and distantly (H. erato, ~15% divergent) related to the reference sequence. Our results confirm that the colour pattern controlling loci account for the majority of divergent regions across the genome, but we also detect other divergent regions apparently unlinked to colour pattern differences. We also use association mapping to identify previously unmapped colour pattern loci, in particular the Ro locus. Finally, we identify within our sample a new cryptic population of H. timareta in Ecuador, which occurs at relatively low altitude and is mimetic with H. melpomene malleti.

Download Full-text

A De Novo Divide-and-Merge Paradigm for Acoustic Model Optimization in Automatic Speech Recognition

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/513 ◽

2020 ◽

Author(s):

Conghui Tan ◽

Di Jiang ◽

Jinhua Peng ◽

Xueyang Wu ◽

Qian Xu ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

De Novo ◽

Superior Performance ◽

Acoustic Model ◽

Acoustic Models ◽

Public Data ◽

Speech Data ◽

Low Efficiency ◽

Novel Algorithms

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.

Download Full-text

A B73 x Palomero Toluqueño mapping population reveals local adaptation in Mexican highland maize

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab447 ◽

2022 ◽

Author(s):

Sergio Perez-Limón ◽

Meng Li ◽

G Carolina Cintora-Martinez ◽

M Rocio Aguilar-Rangel ◽

M Nancy Salazar-Vidal ◽

...

Keyword(s):

Local Adaptation ◽

De Novo ◽

Local Environment ◽

Cultural Value ◽

De Novo Mutation ◽

Quantitative Trait Mapping ◽

Farmer Selection ◽

Maize Varieties ◽

Highland Maize ◽

Genomic Regions

Abstract Generations of farmer selection in the central Mexican highlands have produced unique maize varieties adapted to the challenges of the local environment. In addition to possessing great agronomic and cultural value, Mexican highland maize represents a good system for the study of local adaptation and acquisition of adaptive phenotypes under cultivation. In this study we characterize a recombinant inbred line population derived from the B73 reference line and the Mexican highland maize variety Palomero Toluqueño. B73 and Palomero Toluqueño showed classic rank-changing differences in performance between lowland and highland field sites, indicative of local adaptation. Quantitative trait mapping identified genomic regions linked to effects on yield components that were conditionally expressed depending on the environment. For the principal genomic regions associated with ear weight and total kernel number, the Palomero Toluqueño allele conferred an advantage specifically in the highland site, consistent with local adaptation. We identified Palomero Toluqueño alleles associated with expression of characteristic highland traits, including reduced tassel branching, increased sheath pigmentation and the presence of sheath macrohairs. The oligogenic architecture of these three morphological traits supports their role in adaptation, suggesting they have arisen from consistent directional selection acting at distinct points across the genome. We discuss these results in the context of the origin of phenotypic novelty during selection, commenting on the role of de novo mutation and the acquisition of adaptive variation by gene flow from endemic wild relatives.

Download Full-text

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Genome Biology ◽

10.1186/s13059-019-1905-y ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

Shujun Ou ◽

Weija Su ◽

Yi Liao ◽

Kapeel Chougule ◽

Jireh R. A. Agda ◽

...

Keyword(s):

Transposable Elements ◽

Animal Species ◽

Performance Metrics ◽

De Novo ◽

Terminal Inverted Repeat ◽

Miniature Inverted Transposable Elements ◽

Sensitivity Specificity ◽

Genomic Regions ◽

Assembly Algorithms ◽

Eukaryotic Genomes

Abstract Background Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. Results We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. Conclusions The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text