read length
Recently Published Documents


TOTAL DOCUMENTS

284
(FIVE YEARS 154)

H-INDEX

20
(FIVE YEARS 6)

2025 ◽  
Vol 77 (11) ◽  
pp. 6589-2025
Author(s):  
ALEKSANDRA GIZA ◽  
EWELINA IWAN ◽  
ARKADIUSZ BOMBA ◽  
DARIUSZ WASYL

Sequencing can provide genomic characterisation of a specific organism, as well as of a whole environmental or clinical sample. High Throughput Sequencing (HTS) makes it possible to generate an enormous amount of genomic data at gradually decreasing costs and almost in real-time. HTS is used, among others, in medicine, veterinary medicine, microbiology, virology and epidemiology. The paper presents practical aspects of the HTS technology. It describes generations of sequencing, which vary in throughput, read length, accuracy and costs ̶ and thus are used for different applications. The stages of HTS, as well as their purposes and pitfalls, are presented: extraction of the genetic material, library preparation, sequencing and data processing. For success of the whole process, all stages need to follow strict quality control measurements. Choosing the right sequencing platform, proper sample and library preparation procedures, as well as adequate bioinformatic tools are crucial for high quality results.


2022 ◽  
Author(s):  
Jun Ma ◽  
Manuel Cáceres ◽  
Leena Salmela ◽  
Veli Mäkinen ◽  
Alexandru I. Tomescu

Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications in e.g., improving variant calling. While the vg toolkit (Garrison et al., Nature Biotechnology, 2018) is a popular aligner of short reads, GraphAligner (Rautiainen and Marschall, Genome Biology, 2020) is the state-of-the-art aligner of long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. We present a new algorithm to co-linearly chain a set of seeds in an acyclic variation graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of long reads to variation graphs, GraphChainer. Compared to GraphAligner, at a normalized edit distance threshold of 40%, it aligns 9% to 12% more reads, and 15% to 19% more total read length, on real PacBio reads from human chromosomes 1 and 22. On both simulated and real data, GraphChainer aligns between 97% and 99% of all reads, and of total read length. At the more stringent normalized edit distance threshold of 30%, GraphChainer aligns up to 29% more total real read length than GraphAligner. GraphChainer is freely available at https://github.com/algbio/GraphChainer


BMC Genomics ◽  
2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Nadja Brait ◽  
Büşra Külekçi ◽  
Irene Goerzer

Abstract Background Short read sequencing has been used extensively to decipher the genome diversity of human cytomegalovirus (HCMV) strains, but falls short to reveal individual genomes in mixed HCMV strain populations. Novel third-generation sequencing platforms offer an extended read length and promise to resolve how distant polymorphic sites along individual genomes are linked. In the present study, we established a long amplicon PacBio sequencing workflow to identify the absolute and relative quantities of unique HCMV haplotypes spanning over multiple hypervariable sites in mixtures. Initial validation of this approach was performed with defined HCMV DNA templates derived from cell-culture enriched viruses and was further tested for its suitability on patient samples carrying mixed HCMV infections. Results Total substitution and indel error rate of mapped reads ranged from 0.17 to 0.43% depending on the stringency of quality trimming. Artificial HCMV DNA mixtures were correctly determined down to 1% abundance of the minor DNA source when the total HCMV DNA input was 4 × 104 copies/ml. PCR products of up to 7.7 kb and a GC content < 55% were efficiently generated when DNA was directly isolated from patient samples. In a single sample, up to three distinct haplotypes were identified showing varying relative frequencies. Alignments of distinct haplotype sequences within patient samples showed uneven distribution of sequence diversity, interspersed by long identical stretches. Moreover, diversity estimation at single polymorphic regions as assessed by short amplicon sequencing may markedly underestimate the overall diversity of mixed haplotype populations. Conclusions Quantitative haplotype determination by long amplicon sequencing provides a novel approach for HCMV strain characterisation in mixed infected samples which can be scaled up to cover the majority of the genome by multi-amplicon panels. This will substantially improve our understanding of intra-host HCMV strain diversity and its dynamic behaviour.


Plant Disease ◽  
2022 ◽  
Author(s):  
Laurence Svanella ◽  
Armelle Marais ◽  
Thierry Candresse ◽  
Marie Lefebvre ◽  
Jerome Lluch ◽  
...  

Grapevine virus L (GVL) is a recently described vitivirus (family Betaflexiviridae) with a positive-sense single-stranded RNA genome. It has so far been reported from China, Croatia, New-Zealand, the United States and Tunisia (Debat et al. 2019; Diaz-Lara et al. 2019; Alabi et al. 2020; Ben Amar et al. 2020). It has significant genetic variability (up to 26% of nucleotide divergence between isolates) and the existence of four phylogroups has been proposed (Alabi et al. 2020). In the frame of a project investigating the possible links between grapevine trunk diseases and grapevine virome, viral high throughput sequencing (HTS)-based testing was performed on symptomatic and asymptomatic grapevines collected in July 2019 in vineyards of four areas in France (Bourgogne, Charentes, Gard, Gironde) corresponding to five cultivars of Vitis vinifera (Cabernet franc, Cabernet Sauvignon, Chardonnay, Sauvignon, Ugni blanc). Total RNAs were purified from powder of 105 trunk wood samples using the Spectrum™ Plant Total RNA Kit (Sigma-Aldrich, Saint-Quentin-Fallavier, France) and RNA-seq libraries were prepared using Zymo-Seq RiboFree Total RNA Library Prep Kit (Ozyme, Saint Cyr l’Ecole, France). HTS was performed on a S4 lane of Illumina NovaSeq 6000 using a paired-end read length of 2x150 bp. The trimmed sequence reads obtained from Chardonnay plants CH30-75M (99.9 M) and CH37-19S (114 M) from a vineyard in Gard were analyzed using CLC Genomics Workbench v21 (Qiagen, Courtaboeuf, France) and revealed complex mixed infections. Besides contigs representing a complete GVL genome (average scaffold coverage: 6,197x and 2,970x, respectively), contigs from grapevine rupestris stem pitting virus (1,697x ; 1,124x), grapevine virus A (82x ; 95x), grapevine pinot gris virus (1,475x ; 866x), grapevine leafroll-associated virus 3 (5,122x ; 1,042x), hop stunt viroid (13,783x ; 29,514x) and grapevine yellow speckle viroid 1 (690x ; 1158x) were also identified. Plant CH37-19S was also co-infected by grapevine rupestris vein feathering virus (164x). The GVL contigs integrated respectively 320,000 and 152,000 reads (corresponding to 0.32% and 0.11% of filtered/trimmed reads, respectively). The GVL genomic sequences from each sample (7,616 nt) have been deposited in GenBank (Accession nos. OK042110 and OK042111, respectively). The two contigs are nearly identical (99.9% nt identity) and share respectively 97.5% and 95.9% with GVL-KA from the USA (MH643739) and GVL-RS from China (MH248020), the closest isolates present in GenBank. To confirm the presence of GVL, the original grapevines were resampled in the field and total RNAs were extracted as described above from cambial scrappings and leaves. Total RNAs were used for RT-PCR tests using primers targeting a 279-bp fragment corresponding to the 3’ end of the coat protein gene and part of the nucleic acid binding protein gene (Debat et al. 2019). The Sanger-derived sequences from the amplicons shared 100% nt identities with the corresponding sequences of the HTS assembled genomes, confirming the presence of GVL in both tissues of both grapevine samples. To our knowledge, this represents the first report of the occurrence of GVL in vineyards in France. Given the complex mixed infection present in the two analyzed grapevines, no conclusions can be drawn on the pathogenicity of GVL. Further efforts are needed to better understand GVL distribution and its potential pathogenicity to grapevine. References Alabi, O J., et al. 2020. Arch. of Virol. 165:1905-1909. Ben Amar, A., et al. 2020. Plant disease 104:3274. Debat, H., et al. 2019. Eur J Plant Pathol. 155:319. Diaz-Lara, A., et al. 2019. Arch. of Virol. 164:2573. Acknowledgments The authors are grateful to the “Plan National Dépérissement du Vignoble” (Mycovir project) for the financial support


2021 ◽  
Author(s):  
Chen Yang ◽  
Theodora Lo ◽  
Ka Ming Nip ◽  
Saber Hafezqorani ◽  
René L Warren ◽  
...  

Abstract Background: Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools, such as microbial abundance estimation and metagenome assembly algorithms. When developing and testing bioinformatics tools and pipelines, the use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to provide a ground truth and assess the performance in a controlled environment. Results: Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes, and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. Conclusions: The Meta-NanoSim characterization module investigates read features including chimeric information and abundance levels, while the simulation module simulates large and complex multi-sample microbial communities with different abundance profiles. All trained models and the software are freely accessible at Github: https://github.com/bcgsc/NanoSim .


2021 ◽  
Author(s):  
Taobo Hu ◽  
Jingjing Li ◽  
Mengping Long ◽  
Jinbo Wu ◽  
Zhen Zhang ◽  
...  

Abstract Background: Structural variations (SVs) are common genetic alterations in the human genome that could cause different phenotypes and various diseases including cancer. However, the detection of structural variations using the second-generation sequencing was limited by its short read-length which in turn restrained our understanding of structural variations. Methods: In this study, we developed a 28-gene panel for long-read sequencing and employed it to both Oxford Nanopore Technologies and Pacific Biosciences platforms. We analyzed structural variations in the 28 breast cancer-related genes through long-read genomic and transcriptomic sequencing of tumor, para-tumor and blood samples in 19 breast cancer patients. Results: Our results showed that some somatic SVs were recurring among the selected genes, though the majority of them occurred in the non-exonic region. We found evidence supporting the existence of hotspot regions for SVs, which extended our previous understanding that they exist only for single nucleotide variations. Conclusions: In conclusion, we employed long-read genomic and transcriptomic sequencing in identifying SVs from breast cancer patients and proved that this approach holds great potential in clinical application.


Biology ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1274
Author(s):  
Yunqing Liu ◽  
Xin Liao ◽  
Tingyu Han ◽  
Ao Su ◽  
Zhuojun Guo ◽  
...  

Coral–zooxanthellae holobionts are one of the most productive ecosystems in the ocean. With global warming and ocean acidification, coral ecosystems are facing unprecedented challenges. To save the coral ecosystems, we need to understand the symbiosis of coral–zooxanthellae. Although some Scleractinia (stony corals) transcriptomes have been sequenced, the reliable full-length transcriptome is still lacking due to the short-read length of second-generation sequencing and the uncertainty of the assembly results. Herein, PacBio Sequel II sequencing technology polished with the Illumina RNA-seq platform was used to obtain relatively complete scleractinian coral M. foliosa transcriptome data and to quantify M. foliosa gene expression. A total of 38,365 consensus sequences and 20,751 unique genes were identified. Seven databases were used for the gene function annotation, and 19,972 genes were annotated in at least one database. We found 131 zooxanthellae transcripts and 18,829 M. foliosa transcripts. A total of 6328 lncRNAs, 847 M. foliosa transcription factors (TFs), and 2 zooxanthellae TF were identified. In zooxanthellae we found pathways related to symbiosis, such as photosynthesis and nitrogen metabolism. Pathways related to symbiosis in M. foliosa include oxidative phosphorylation and nitrogen metabolism, etc. We summarized the isoforms and expression level of the symbiont recognition genes. Among the membrane proteins, we found three pathways of glycan biosynthesis, which may be involved in the organic matter storage and monosaccharide stabilization in M. foliosa. Our results provide better material for studying coral symbiosis.


2021 ◽  
Author(s):  
Ram Ayyala ◽  
Junghyun Jung ◽  
Sergey Knyazev ◽  
SERGHEI MANGUL

Although precise identification of the human leukocyte antigen (HLA) allele is crucial for various clinical and research applications, HLA typing remains challenging due to high polymorphism of the HLA loci. However, with Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types from RNA sequencing (RNA-seq) data. However, there is a lack of comprehensive and systematic benchmarking of RNA-seq HLA callers using large-scale and realist gold standards. In order to address this limitation, we rigorously compared the performance of 12 HLA callers over 50,000 HLA tasks including searching 30 pairwise combinations of HLA callers and reference in over 1,500 samples. In each case, we produced evaluation metrics of accuracy that is the percentage of correctly predicted alleles (two and four-digit resolution) based on six gold standard datasets spanning 650 RNA-seq samples. To determine the influence of the relationship of the read length over the HLA region on prediction quality using each tool, we explored the read length effect by considering read length in the range 37-126 bp, which was available in our gold standard datasets. Moreover, using the Genotype-Tissue Expression (GTEx) v8 data, we carried out evaluation metrics by calculating the concordance of the same HLA type across different tissues from the same individual to evaluate how well the HLA callers can maintain consistent results across various tissues of the same individual. This study offers crucial information for researchers regarding appropriate choices of methods for an HLA analysis.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12446
Author(s):  
Darlene D. Wagner ◽  
Heather A. Carleton ◽  
Eija Trees ◽  
Lee S. Katz

Background Whole genome sequencing (WGS) has gained increasing importance in responses to enteric bacterial outbreaks. Common analysis procedures for WGS, single nucleotide polymorphisms (SNPs) and genome assembly, are highly dependent upon WGS data quality. Methods Raw, unprocessed WGS reads from Escherichia coli, Salmonella enterica, and Shigella sonnei outbreak clusters were characterized for four quality metrics: PHRED score, read length, library insert size, and ambiguous nucleotide composition. PHRED scores were strongly correlated with improved SNPs analysis results in E. coli and S. enterica clusters. Results Assembly quality showed only moderate correlations with PHRED scores and library insert size, and then only for Salmonella. To improve SNP analyses and assemblies, we compared seven read-healing pipelines to improve these four quality metrics and to see how well they improved SNP analysis and genome assembly. The most effective read healing pipelines for SNPs analysis incorporated quality-based trimming, fixed-width trimming, or both. The Lyve-SET SNPs pipeline showed a more marked improvement than the CFSAN SNP Pipeline, but the latter performed better on raw, unhealed reads. For genome assembly, SPAdes enabled significant improvements in healed E. coli reads only, while Skesa yielded no significant improvements on healed reads. Conclusions PHRED scores will continue to be a crucial quality metric albeit not of equal impact across all types of analyses for all enteric bacteria. While trimming-based read healing performed well for SNPs analyses, different read healing approaches are likely needed for genome assembly or other, emerging WGS analysis methodologies.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tao Jiang ◽  
Shiqi Liu ◽  
Shuqi Cao ◽  
Yadong Liu ◽  
Zhe Cui ◽  
...  

Abstract Background With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously. Results In this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs. Conclusions Based on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.


Sign in / Sign up

Export Citation Format

Share Document