Fishing for DNA? Designing baits for population genetics in target enrichment experiments: guidelines, considerations and the new tool supeRbaits

Targeted sequencing is an increasingly popular Next Generation Sequencing (NGS) approach for studying populations, through focusing sequencing efforts on specific parts of the genome of a species of interest. Methodologies and tools for designing targeted baits are scarce but in high demand. Here, we present specific guidelines and considerations for designing capture sequencing experiments for population genetics for both neutral genomic regions and regions subject to selection. We describe the bait design process for three diverse fish species: Atlantic salmon, Atlantic cod and tiger shark, which was carried out in our research group, and provide an evaluation of the performance of our approach across both historical and modern samples. The workflow used for designing these three bait sets has been implemented in the R-package supeRbaits, which encompass our considerations and guidelines for bait design to benefit researchers and practitioners. The supeRbaits R package is user‐friendly and versatile. It is written in C++ and implemented in R. supeRbaits and its manual are available from Github: https://github.com/BelenJM/supeRbaits

Download Full-text

GARCOM: A user-friendly R package for genetic mutation counts

F1000Research ◽

10.12688/f1000research.53858.1 ◽

2021 ◽

Vol 10 ◽

pp. 524

Author(s):

Sanjeev Sariya ◽

Dr. Giuseppe Tosto

Keyword(s):

Input Data ◽

R Package ◽

Genetic Mutation ◽

Effect Sizes ◽

Low Frequencies ◽

R Language ◽

Common Strategy ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Generation Sequencing

Next-generation sequencing (NGS) has enabled analysis of rare and uncommon variants in large study cohorts. A common strategy to overcome these low frequencies and/or small effect sizes relies on collapsing strategies, i.e. to bin variants within genes/regions. Several tools are now available for advanced statistical analyses however, tools to perform basic tasks such as obtaining allelic counts within defined genetics boundaries are unavailable or require complex coding. GARCOM library, an open-source freely available package in R language, returns a matrix with allelic counts within defined genetic boundaries. GARCOM accepts input data in PLINK or VCF formats, with additional options to subset data for refined analyses.

Download Full-text

ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files

Bioinformatics ◽

10.1093/bioinformatics/btz937 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2587-2588 ◽

Cited By ~ 10

Author(s):

Christopher M Ward ◽

Thu-Hien To ◽

Stephen M Pederson

Keyword(s):

Quality Control ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Supplementary Data ◽

Large Sample ◽

Log Files ◽

Shiny App ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CHIPIN: ChIP-seq inter-sample normalization based on signal invariance across transcriptionally constant genes

BMC Bioinformatics ◽

10.1186/s12859-021-04320-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lélia Polit ◽

Gwenneg Kerdivel ◽

Sebastian Gregoricchio ◽

Michela Esposito ◽

Christel Guillouf ◽

...

Keyword(s):

Signal To Noise Ratio ◽

R Package ◽

Superior Performance ◽

Open Chromatin ◽

Data Sets ◽

Drug Treatments ◽

On Chip ◽

Genomic Regions ◽

Signal Normalization ◽

User Friendly

Abstract Background Multiple studies rely on ChIP-seq experiments to assess the effect of gene modulation and drug treatments on protein binding and chromatin structure. However, most methods commonly used for the normalization of ChIP-seq binding intensity signals across conditions, e.g., the normalization to the same number of reads, either assume a constant signal-to-noise ratio across conditions or base the estimates of correction factors on genomic regions with intrinsically different signals between conditions. Inaccurate normalization of ChIP-seq signal may, in turn, lead to erroneous biological conclusions. Results We developed a new R package, CHIPIN, that allows normalizing ChIP-seq signals across different conditions/samples when spike-in information is not available, but gene expression data are at hand. Our normalization technique is based on the assumption that, on average, no differences in ChIP-seq signals should be observed in the regulatory regions of genes whose expression levels are constant across samples/conditions. In addition to normalizing ChIP-seq signals, CHIPIN provides as output a number of graphs and calculates statistics allowing the user to assess the efficiency of the normalization and qualify the specificity of the antibody used. In addition to ChIP-seq, CHIPIN can be used without restriction on open chromatin ATAC-seq or DNase hypersensitivity data. We validated the CHIPIN method on several ChIP-seq data sets and documented its superior performance in comparison to several commonly used normalization techniques. Conclusions The CHIPIN method provides a new way for ChIP-seq signal normalization across conditions when spike-in experiments are not available. The method is implemented in a user-friendly R package available on GitHub: https://github.com/BoevaLab/CHIPIN

Download Full-text

gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks

Bioinformatics ◽

10.1093/bioinformatics/btz308 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4419-4421 ◽

Cited By ~ 3

Author(s):

Sun Ah Kim ◽

Myriam Brossard ◽

Delnaz Roshandel ◽

Andrew D Paterson ◽

Shelley B Bull ◽

...

Keyword(s):

Clustering Algorithms ◽

R Package ◽

Supplementary Information ◽

Visualization Tool ◽

Sequencing Data ◽

Haplotype Blocks ◽

Snp Data ◽

Computing Environments ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments. Availability and implementation The R package is available at https://bioconductor.org/packages/gpart. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intra-Host Viral Populations

Molecular Biology and Evolution ◽

10.1093/molbev/msaa315 ◽

2020 ◽

Author(s):

Matthew L Bendall ◽

Keylie M Gibson ◽

Margaret C Steiner ◽

Uzma Rentia ◽

Marcos Pérez-Losada ◽

...

Keyword(s):

Deep Sequencing ◽

De Novo ◽

Consensus Sequence ◽

Haplotype Reconstruction ◽

Consensus Sequences ◽

Genome Wide ◽

Genomic Regions ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract Deep sequencing of viral populations using next generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intra-host viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.

Download Full-text

sgDI-tector: defective interfering viral genome bioinformatics for detection of coronavirus subgenomic RNAs

10.1101/2021.11.30.470527 ◽

2021 ◽

Author(s):

Andrea Di Gioacchino ◽

Rachel Legendre ◽

Yannis Rahou ◽

Valerie Najburg ◽

Pierre Charneau ◽

...

Keyword(s):

Regulatory Sequences ◽

Accessory Proteins ◽

Viral Genomes ◽

Bioinformatic Tools ◽

Subgenomic Rnas ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Ngs Data ◽

Initial Knowledge ◽

Generation Sequencing

Coronavirus RNA-dependent RNA polymerases produce subgenomic RNAs (sgRNAs) that encode viral structural and accessory proteins. User-friendly bioinformatic tools to detect and quantify sgRNA production are urgently needed to study the growing number of next-generation sequencing (NGS) data of SARS-CoV-2. We introduced sgDI-tector to identify and quantify sgRNA in SARS-CoV-2 NGS data. sgDI-tector allowed detection of sgRNA without initial knowledge of the transcription-regulatory sequences. We produced NGS data and successfully detected the nested set of sgRNAs with the ranking M>ORF3a>N>ORF6>ORF7a>ORF8>S>E>ORF7b. We also compared the level of sgRNA production with other types of viral RNA products such as defective interfering viral genomes.

Download Full-text

SequencEnG: an Interactive Knowledge Base of Sequencing Techniques

10.1101/319079 ◽

2018 ◽

Author(s):

Yi Zhang ◽

Mohith Manjunath ◽

Yeonsung Kim ◽

Joerg Heintz ◽

Jun S. Song

Keyword(s):

Knowledge Base ◽

Educational Resource ◽

Rapid Progress ◽

Structured Knowledge ◽

Ngs Data Analysis ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Ngs Data ◽

Generation Sequencing ◽

Acute Challenge

AbstractNext-generation sequencing (NGS) techniques are revolutionizing biomedical research by providing powerful methods for generating genomic and epigenomic profiles. The rapid progress is posing an acute challenge to students and researchers to stay acquainted with the numerous available methods. We have developed an interactive online educational resource called SequencEnG (acronym for Sequencing Techniques Engine for Genomics) to provide a tree-structured knowledge base of 66 different sequencing techniques and step-by-step NGS data analysis pipelines comparing popular tools. SequencEnG is designed to facilitate barrier-free learning of current NGS techniques and provides a user-friendly interface for searching through experimental and analysis methods. SequencEnG is part of the project KnowEnG (Knowledge Engine for Genomics) and is freely available at http://education.knoweng.org/sequenceng/.

Download Full-text

Easymap: A User-Friendly Software Package for Rapid Mapping-by-Sequencing of Point Mutations and Large Insertions

Frontiers in Plant Science ◽

10.3389/fpls.2021.655286 ◽

2021 ◽

Vol 12 ◽

Author(s):

Samuel Daniel Lup ◽

David Wilson-Sánchez ◽

Sergio Andreu-Sánchez ◽

José Luis Micol

Keyword(s):

Point Mutations ◽

Rapid Identification ◽

Graphical Interface ◽

Rna Seq ◽

Source Program ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Ngs Data ◽

Mapping By Sequencing ◽

Generation Sequencing

Mapping-by-sequencing strategies combine next-generation sequencing (NGS) with classical linkage analysis, allowing rapid identification of the causal mutations of the phenotypes exhibited by mutants isolated in a genetic screen. Computer programs that analyze NGS data obtained from a mapping population of individuals derived from a mutant of interest to identify a causal mutation are available; however, the installation and usage of such programs requires bioinformatic skills, modifying or combining pieces of existing software, or purchasing licenses. To ease this process, we developed Easymap, an open-source program that simplifies the data analysis workflows from raw NGS reads to candidate mutations. Easymap can perform bulked segregant mapping of point mutations induced by ethyl methanesulfonate (EMS) with DNA-seq or RNA-seq datasets, as well as tagged-sequence mapping for large insertions, such as transposons or T-DNAs. The mapping analyses implemented in Easymap have been validated with experimental and simulated datasets from different plant and animal model species. Easymap was designed to be accessible to all users regardless of their bioinformatics skills by implementing a user-friendly graphical interface, a simple universal installation script, and detailed mapping reports, including informative images and complementary data for assessment of the mapping results. Easymap is available at http://genetics.edu.umh.es/resources/easymap; its Quickstart Installation Guide details the recommended procedure for installation.

Download Full-text

Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa076 ◽

2020 ◽

Vol 2 (4) ◽

Cited By ~ 1

Author(s):

Thomas P Quinn ◽

Ionas Erb

Keyword(s):

Compositional Data ◽

R Package ◽

Data Driven ◽

Alternative Methods ◽

Compositional Data Analysis ◽

Relative Information ◽

Technical Factors ◽

User Friendly ◽

Log Ratio ◽

Generation Sequencing

Abstract Many next-generation sequencing datasets contain only relative information because of biological and technical factors that limit the total number of transcripts observed for a given sample. It is not possible to interpret any one component in isolation. The field of compositional data analysis has emerged with alternative methods for relative data based on log-ratio transforms. However, these data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension method called data-driven amalgamation. Our new method, implemented in the user-friendly R package amalgam, can reduce the dimensionality of compositional data by finding amalgamations that optimally (i) preserve the distance between samples, or (ii) classify samples as diseased or not. Our benchmark on 13 real datasets confirm that these amalgamations compete with state-of-the-art methods in terms of performance, but result in new features that are easily understood: they are groups of parts added together.

Download Full-text