scholarly journals benchNGS : An approach to benchmark short reads alignment tools

2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Martin Kryshchenko ◽  
Inna Dubchak ◽  
Nikolai Nickolai Alexandrov ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the precomputed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools.

2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Kryshchenko ◽  
Inna Dubchak ◽  
Tatiana V Tatarinova ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools .


2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Kryshchenko ◽  
Inna Dubchak ◽  
Tatiana V Tatarinova ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools .


2019 ◽  
Vol 5 (Supplement_1) ◽  
Author(s):  
Julia Hillung ◽  
María Alma Bracho ◽  
Javier Pons Tamarit ◽  
Fernando González-Candelas

Abstract Next-generation sequencing (NGS) is a technique that can capture the variability of viral populations in transmission studies. The conventional sample preparation for NGS, based on amplicons, is a potential source of errors, derived from the variable affinity of specific primers for different viral variants and from irregular DNA polymerase efficiency. In this context, we propose a more reliable method for viral whole genome sample preparation, starting from nucleic acids obtained and stored with conventional procedures. Our goal was to obtain complete hepatitis C virus (HCV) genome sequences to subsequently perform extensive phylogenetic analyses. Additionally, we aimed to test the effectiveness of nuclease treatment used to remove contaminating host DNA. Nucleic acids were obtained from almost cell-free blood plasma of HCV-infected patients. As a source for Illumina library preparation, double-stranded cDNA was generated using random primers. The HCV genome was not amplified before library preparation, avoiding possible biases derived from unequal copying. To get rid of possible host contaminants in the samples, a DNase treatment step was added. Libraries were paired-end sequenced on the Illumina platform using MiSeq reagent kit v3. After conservative filtering of contaminant human reads by alignment with the human reference genome using Burrows-Wheeler Aligner (BWA), the remaining reads were mapped to the HCV reference genome using BWA. Primary maximum likelihood phylogenetic analyses were performed using ClustalW and IQTREE to infer the phylogenetic relationships of the sequenced samples in the context of complete genome sequences of the same genotype. NGS sample preparation method of HCV from blood plasma was established. Complete genome sequences of HCV could be obtained with variable coverage depending on the viral load of plasma samples. No significant reduction of host DNA proportion in DNase treated samples in comparison to the controls was observed. The new sequences clustered within the Los Alamos National Laboratory database-deposited HCV subtype 4d samples. The method can be used to obtain full-length sequences of HCV from nucleic acid samples not previously planned for NGS. No improvement was observed when DNase pre-treatment of nucleic acids extracted from blood plasma was performed.


2016 ◽  
Vol 4 (2) ◽  
Author(s):  
Deborah Moine ◽  
Mohamed Kassam ◽  
Leen Baert ◽  
Yanjie Tang ◽  
Caroline Barretto ◽  
...  

Cronobacter is associated with infant infections and the consumption of reconstituted infant formula. Here we sequenced and closed six genomes of C. condimenti T , C. muytjensii T , C. universalis T , C. malonaticus T , C. dublinensis T , and C. sakazakii that can be used as reference genomes in single nucleotide polymorphism (SNP)-based next-generation sequencing (NGS) analysis for source tracking investigations.


Author(s):  
Tao Zhou ◽  
Liang Lu ◽  
Chenhong Li

A combination of next-generation sequencing technologies and mate-pair libraries of large insert sizes is used as a standard method to generate genome assemblies with high contiguity. The third-generation sequencing techniques also are used to improve the quality of assembled genomes. However, both mate-pair libraries and the third-generation libraries require high-molecular-weight DNA, making the use of these libraries inappropriate for samples with only degraded DNA. An in silico method that generates mate-pair libraries using a reference genome was devised for the task of assembling target genomes. Although the contiguity and completeness of assembled genomes were significantly improved by this method, a high level of errors manifested in the assembly, further to which the methods for using reference genomes were not optimized. Here, we tested different strategies for using reference genomes to generate in silico mate-pairs. The results showed that using a closely related reference genome from the same genus was more effective than using divergent references. Conservation of in silico mate-pairs by comparing two references and using those to guide genome assembly reduced the number of misassemblies (18.6% – 46.1%) and increased the contiguity of assembled genomes (9.7% – 70.7%), while maintaining gene completeness at a level that was either similar or marginally lower than that obtained via the current method. Finally, we compared the optimized method with another reference-guided assembler, RaGOO. We found that RaGOO produced longer scaffolds (17.8 Mbp vs 3.0 Mbp), but resulted in a much higher misassembly rate (85.68%) than our optimized in silico mate-pair method.


2020 ◽  
Author(s):  
Inácio Gomes Medeiros ◽  
André Salim Khayat ◽  
Beatriz Stransky ◽  
Sidney Emanuel Batista dos Santos ◽  
Paulo Pimentel de Assumpção ◽  
...  

Abstract This protocol aims to describe the building of a database of SARS-CoV-2 targets for siRNA approaches. Starting from the virus reference genome, we will derive sequences from 18 to 21nt-long and verify their similarity against the human genome and coding and non-coding transcriptome, as well as genomes from related viruses. We will also calculate a set of thermodynamic features for those sequences and will infer their efficiencies using three different predictors. The protocol has two main phases: at first, we align sequences against reference genomes. In the second one, we extract the features. The first phase varies in terms of duration, depending on computational power from the running machine and the number of reference genomes. Despite that, the second phase lasts about thirty minutes of execution, also depending on the number of cores of running machine. The constructed database aims to speed the design process by providing a broad set of possible SARS-CoV-2 sequences targets and siRNA sequences.


Genes ◽  
2021 ◽  
Vol 12 (12) ◽  
pp. 1917
Author(s):  
Francesco Pepe ◽  
Pasquale Pisapia ◽  
Gianluca Russo ◽  
Mariantonia Nacchio ◽  
Pierlorenzo Pallante ◽  
...  

High-grade serous ovarian carcinoma (HGSOC) is the most common subtype of all ovarian carcinomas. HGSOC harboring BRCA1/2 germline or somatic mutations are sensitive to the poly (adenosine diphosphate-ribose) polymerase inhibitors (PARPi). Therefore, detecting these mutations is crucial to identifying patients for PARPi-targeted treatment. In the clinical setting, next generation sequencing (NGS) has proven to be a reliable diagnostic approach BRCA1/2 molecular evaluation. Here, we review the results of our BRCA1/2 NGS analysis obtained in a year and a half of diagnostic routine practice. BRCA1/2 molecular NGS records of HGSOC patients were retrieved from our institutional archive covering the period from January 2020 to September 2021. NGS analysis was performed on the Ion S5™ System (Thermo Fisher Scientific, Waltham, MA, USA) with the Oncomine™ BRCA Research Assay panel (Thermo Fisher Scientific). Variants were classified as pathogenic or likely pathogenic according to the guidelines of the American College of Medical Genetics and Genomics by using the inspection of Evidence-based Network for the Interpretation of Germline Mutant Alleles (ENIGMA) and ClinVar (NCBI) databases. Sixty-five HGSOC patient samples were successfully analyzed. Overall, 11 (16.9%) out of 65 cases harbored a pathogenic alteration in BRCA1/2, in particular, six BRCA1 and five BRCA2 pathogenic variations. This study confirms the efficiency and high sensitivity of NGS analysis in detecting BRCA1/2 germline or somatic variations in patients with HGSOC.


2018 ◽  
Author(s):  
Mohammad Shabbir Hasan ◽  
Xiaowei Wu ◽  
Liqing Zhang

AbstractIn current practice, Next Generation Sequencing (NGS) applications start with mapping/aligning short reads to the reference genome, with the aim of identifying genetic mutations. While most short reads can be mapped to the reference genome accurately by existing alignment tools, a significant number remain unmapped and excluded from downstream analyses thus potentially discarding important biological information hidden in the unmapped reads. This paper describes Genesis-indel, a computational pipeline that explores the unmapped reads to identify novel indels that are initially missed in the alignment procedure. Genesis-indel is applied to the unmapped reads of 30 Breast Cancer patients from TCGA. Results show that the unmapped reads are conserved between the two subtypes of breast cancer investigated in this study and might contribute to the divergence between the subtypes. Genesis-indel is able to leverage the unmapped reads to identify 72,997 small to large novel high-quality indels previously not found in the original alignments and among them, 16,141 have not been annotated in the widely used mutation database. Statistical analysis shows that these new indels mostly altered the oncogenes and tumor suppressor genes. Functional annotation further reveals that these indels are strongly correlated to pathways of cancer and can have high to moderate impact on protein functions. Additionally, these indels overlap with the genes that are missed in the indels from the originally mapped reads and contribute to the tumorigenesis in multiple carcinomas.


2020 ◽  
Author(s):  
Pablo Duchen ◽  
Nicolas Salamin

AbstractNext-generation-sequencing haplotype callers are commonly used in studies to call variants from newly-sequenced species. However, due to the current availability of genomic resources, it is still common practice to use only one reference genome for a given genus, or even one reference for an entire clade of a higher taxon. The problem with traditional haplotype callers such as the one from GATK, is that they are optimized for variant calling at the population level, but not at the phylogenetic level. Thus, the consequences for downstream analyses can be substantial. Here, through simulations, we compare the performance between the haplotype callers of GATK and ATLAS, and present their differences at various phylogenetic scales. We show how the haplotype caller of GATK substantially underestimates the number of variants at the phylogenetic level, but not at the population level. We also quantified the level at which the accuracy of heterozygote calls declines with increasing distance to the reference genome. Such decrease is very sharp in GATK, while ATLAS maintains a high accuracy in variant calling, even at moderately-divergent species from the reference. We further suggest that efforts should be taken towards acquiring more reference genomes per species, before pursuing high-scale phylogenomic studies.


Sign in / Sign up

Export Citation Format

Share Document