scholarly journals Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

2021 ◽  
Author(s):  
Simon Lee ◽  
Loan T. Nguyen ◽  
Ben J. Hayes ◽  
Elizabeth M Ross

Motivation: Quality control (QC) tools are critical in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford Nanopore Technologies (ONT) QC is currently rudimentary, generally based on whole read average quality. This results in discarding reads that contain regions of high quality sequence. Here we propose Prowler, a multi-window approach inspired by algorithms used to QC short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results: Prowler was applied to mammalian and bacterial datasets, to assess effects on alignment and assembly respectively. Compared to Nanofilt, alignments of data QCed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler QCed data had a lower error rate than Nanofilt QCed data however this came at some cost to assembly contiguity. Availability and implementation: Prowler is implemented in Python and is available at: https://github.com/ProwlerForNanopore/ProwlerTrimmer Contact: [email protected]

Author(s):  
Simon Lee ◽  
Loan T Nguyen ◽  
Ben J Hayes ◽  
Elizabeth M Ross

Abstract Motivation Trimming and filtering tools are useful in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford nanopore technologies (ONT) trimming and filtering tools are currently rudimentary, generally only filtering reads based on whole read average quality. This results in discarding reads that contain regions of high-quality sequence. Here, we propose Prowler, a trimmer that uses a window-based approach inspired by algorithms used to trim short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results Prowler was applied to mammalian and bacterial datasets, to assess its effect on alignment and assembly, respectively. Compared to data filtered with Nanofilt, alignments of data trimmed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler trimmed data had a lower error rate than those filtered with Nanofilt; however, this came at some cost to assembly contiguity. Availability and implementation Prowler is implemented in Python and is available at https://github.com/ProwlerForNanopore/ProwlerTrimmer. Supplementary information Supplementary data are available at Bioinformatics online.


2012 ◽  
Vol 62 (2) ◽  
pp. 451-458 ◽  
Author(s):  
Domenico Davolos ◽  
Biancamaria Pietrangeli ◽  
Anna Maria Persiani ◽  
Oriana Maggi

The morphology of three phenetically identical Penicillium isolates, collected from the bioaerosol in a restoration laboratory in Italy, displayed macro- and microscopic characteristics that were similar though not completely ascribable to Penicillium raistrickii. For this reason, a phylogenetic approach based on DNA sequencing analysis was performed to establish both the taxonomic status and the evolutionary relationships of these three peculiar isolates in relation to previously described species of the genus Penicillium. We used four nuclear loci (both rRNA and protein coding genes) that have previously proved useful for the molecular investigation of taxa belonging to the genus Penicillium at various evolutionary levels. The internal transcribed spacer region (ITS1–5.8S–ITS2), domains D1 and D2 of the 28S rDNA, a region of the tubulin beta chain gene (benA) and part of the calmodulin gene (cmd) were amplified by PCR and sequenced. Analysis of the rRNA genes and of the benA and cmd sequence data indicates the presence of three isogenic isolates belonging to a genetically distinct species of the genus Penicillium, here described and named Penicillium simile sp. nov. (ATCC MYA-4591T  = CBS 129191T). This novel species is phylogenetically different from P. raistrickii and other related species of the genus Penicillium (e.g. Penicillium scabrosum), from which it can be distinguished on the basis of morphological trait analysis.


Author(s):  
Bilgenur Baloğlu ◽  
Zhewei Chen ◽  
Vasco Elbrecht ◽  
Thomas Braukmann ◽  
Shanna MacDonald ◽  
...  

AbstractMetabarcoding has become a common approach to the rapid identification of the species composition in a mixed sample. The majority of studies use established short-read high-throughput sequencing platforms. The Oxford Nanopore MinION™, a portable sequencing platform, represents a low-cost alternative allowing researchers to generate sequence data in the field. However, a major drawback is the high raw read error rate that can range from 10% to 22%.To test if the MinION™ represents a viable alternative to other sequencing platforms we used rolling circle amplification (RCA) to generate full-length consensus DNA barcodes (658bp of cytochrome oxidase I - COI) for a bulk mock sample of 50 aquatic invertebrate species. By applying two different laboratory protocols, we generated two MinION™ runs that were used to build consensus sequences. We also developed a novel Python pipeline, ASHURE, for processing, consensus building, clustering, and taxonomic assignment of the resulting reads.We were able to show that it is possible to reduce error rates to a median accuracy of up to 99.3% for long RCA fragments (>45 barcodes). Our pipeline successfully identified all 50 species in the mock community and exhibited comparable sensitivity and accuracy to MiSeq. The use of RCA was integral for increasing consensus accuracy, but it was also the most time-consuming step during the laboratory workflow and most RCA reads were skewed towards a shorter read length range with a median RCA fragment length of up to 1262bp. Our study demonstrates that Nanopore sequencing can be used for metabarcoding but we recommend the exploration of other isothermal amplification procedures to improve consensus length.


Author(s):  
Amy Gernon ◽  
Ermias Woldu ◽  
Michele Godlevski ◽  
Willie Wilson ◽  
Rodney C. Gilmore ◽  
...  

Demands for higher quantity and quality of sequence data during genome sequencing projects have led to a need for completely automated reagent systems designed to isolate, process, and analyze DNA samples. While much attention has been given to methodologies aimed at increasing the throughput of sample preparation and reaction setup, purification of the products of sequencing reactions has received less scrutiny despite the profound influence that purification has on sequence quality. Commonly used and commercially available sequencing reaction cleanup methods are not optimal for purifying sequencing reactions generated from larger templates, including bacterial artificial chromosomes (BACs) and those generated by rolling circle amplification. Theoretically, these methods would not remove the original template since they only exclude small molecules and retain large molecules in the sample. If the large template remains in the purified sample, it could understandably interfere with electrokinetic injection and capillary performance. We demonstrate that the use of MagneSil® paramagnetic particles (PMPs) to purify ABI PRISM® BigDye® sequencing reactions increases the quality and read length of sequences from large templates. The high-quality sequence data obtained by our procedure is independent of the size of template DNA used and can be completely automated on a variety of automated platforms.


Author(s):  
Yun Gyeong Lee ◽  
Sang Chul Choi ◽  
Yuna Kang ◽  
Kyeong Min Kim ◽  
Chon-Sik Kang ◽  
...  

The whole genome sequencing (WGS) has become a crucial tool to understand genome structure and genetic variation. The MinION sequencing of Oxford Nanopore Technologies (ONT) is an excellent approach for performing WGS and has advantages in comparison with other Next-Generation Sequencing (NGS): It is relatively inexpensive, portable, has simple library preparation, can be monitored in real-time, and has no theoretical limits on read length. Sorghum bicolor (L.) Moench is diploid (2n = 2x = 20) with a genome size of about 730 Mb, and its genome sequence information is released in the Phytozome database. Therefore, sorghum can be be used as a good reference. However, plant species have complex and large genomes compared to animals or microorganisms. As a result, complete genome sequencing is difficult for plant species. MinION sequencing that produces long-reads can be an excellent tool to overcome the weak assembly of short-reads generated from NGS by minimizing the generation of gaps or covering the repetitive sequence that appears on the plant genome. Here, we conducted the genome sequencing for S. bicolor cv. BTx623 using the MinION platform and obtained 895,678 reads and 17.9 gigabytes(Gb) (ca. 25X coverage of reference) from long-read sequence data. Through a de novo assembly using two different tools and mapped assembled contigs against the sorghum reference genome, a total of 6,124 contigs (covering 45.9%) were generated from Canu, and a total of 2,661 contigs (covering 50%) were generated from Minimap and Miniasm with a Racon pipeline. Our results provide a pipeline of long-read sequencing analysis for plant species using the MinION platform and a clue to determine the total sequencing scale for optimal coverage based on various genome sizes.


2019 ◽  
Vol 56 (5) ◽  
pp. 1253-1259 ◽  
Author(s):  
Samin Jafari ◽  
Mohammad Ali Oshaghi ◽  
Kamran Akbarzadeh ◽  
Mohammad Reza Abai ◽  
Mona Koosha ◽  
...  

AbstractForensically important flesh flies (Diptera: Sarcophagidae) often are not morphologically distinguishable, especially at the immature stage. In addition, female flies are quite similar in general morphology, making accurate identifications difficult. DNA-based technologies, particularly mitochondrial DNA (mtDNA), have been used for species-level identification. The cytochrome oxidase subunits I and II (COI-COII) sequences of Iranian Sarcophagidae are still unavailable in GenBank. In this study as many as 648 (540 males and 106 females) fly specimens from family Sarcophagidae, representing 10 sarcophagid species, including eight forensically important species were collected from seven locations in five Iranian provinces. Of these, 150 male specimens were identified based on both morphology of male genitalia and DNA sequencing analysis. Sequence data from the COI-COII regions for 10 flesh fly species collected in Iran were generated for the first time. Digestion of COI-COII region by restriction enzymes RsaI, EcoRV, and HinfI provided distinct restriction fragment length polymorphism profiles among the species and can serve as molecular markers for species determination. Phylogenetic analysis represented that the COI-COII sequences are helpful for delimitation of sarcophagid species and implementation in forensic entomology. However, the application of the COI-COII fragment as a species identifier requires great caution and additional species and markers should be studied to ensure accurate species identification in the future.


2017 ◽  
Vol 15 (12) ◽  
pp. 857-867
Author(s):  
Plaipol DEDVISITSAKUL ◽  
Sichon HUADRAKSASAT ◽  
Supenya CHITTAPUN ◽  
Theppanya CHAROENRAT ◽  
Chanitchote PIYAPITTAYANUN

C-Phycocyanin, a blue-colored and water soluble protein, is a class of phycobiliproteins that are the major light-harvesting pigments of a photosynthetic system in cyanobacteria. C-phycocyanins are utilized in many industries, including as natural colorants in food and cosmetics and as antioxidant compounds. However, the uses of C-phycocyanins have been limited due to their vulnerability to high temperatures. Therefore, the objective of this study was to identify and analyze the C-phycocyanin gene isolated from Thermosynechococcus sp. TUBT-T01, living in a hot spring in Surat Thani province, in the hope that this C-phycocyanin exhibited thermostable properties and that their applications could be expanded over a wide range of industries. In the present study, the polymerase chain reaction of the gene encoding alpha subunits of C-phycocyanin (cpcA) was performed, using primers designed based upon the sequence alignments of cpcA from Thermosynechococcus sp. available in the GenBank database. The putative cpcA, with an approximate size of 500 base pairs, was detected on an agarose gel. The DNA sequencing analysis indicated that the cpcA was 489 base pairs in length, and its nucleotide sequence was 94 % identical to those of thermophilic Thermosynechococcus sp. NK55, T. elongatus BP-1, and Synechococcus vulcanus. The deduced amino acid sequence was very similar to those of Thermosynechococcus sp. NK55, T. elongatus BP-1, and S. vulcanus. The data derived from the homologous model revealed that the presence of Asp28, Lys32, and Ser72 in the alpha subunit of C-phycocyanin from Thermosynechococcus sp. TUBT-T01 could provide the high thermostability property of this protein.


2016 ◽  
Author(s):  
A. Bernardo Carvalho ◽  
Eduardo G Dupim ◽  
Gabriel Nassar

Genome assembly depends critically on read length. Two recent technologies, PacBio and Oxford Nanopore, produce read lengths above 20 kb, which yield genome assemblies that are vastly superior to those based on Sanger or short-reads. However, the very high error rates of both technologies (around 15%-20%) makes assembly computationally expensive and imprecise at repeats longer than the read length. Here we show that the efficiency and quality of the assembly of these noisy reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in the Illumina reads (which account for ~95% of the distinct k-mers) are deemed as sequencing errors and ignored at the seed alignment step. By focusing on ~5% of the k-mers which are error-free, read overlap sensitivity is dramatically increased. Equally important, the validation procedure can be extended to exclude repetitive k-mers, which avoids read miscorrection at repeats and further improve the resulting assemblies. We tested the k-mer validation procedure in one long-read technology (PacBio) and one assembler (MHAP/ Celera Assembler), but is likely to yield analogous improvements with alternative long-read technologies and overlappers, such as Oxford Nanopore and BLASR/DAligner.


Gigabyte ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-26
Author(s):  
John M. Sutton ◽  
Joshua D. Millwood ◽  
A. Case McCormack ◽  
Janna L. Fierst

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences, but has high error rates, which make sequence assembly and analysis difficult as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly, but few studies have addressed eukaryotic organisms. Here, we present novel results using simulated and empirical ONT and DNA libraries to identify best practices for sequencing and assembly for several model species. We find that the unique error structure of ONT libraries causes errors to accumulate and assembly statistics plateau as sequence depth increases. High-quality assembled eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.


Sign in / Sign up

Export Citation Format

Share Document