scholarly journals Using QC-Blind for Quality Control and Contamination Screening of Bacteria DNA Sequencing Data Without Reference Genome

2019 ◽  
Vol 10 ◽  
Author(s):  
Wang Xi ◽  
Yan Gao ◽  
Zhangyu Cheng ◽  
Chaoyun Chen ◽  
Maozhen Han ◽  
...  
2019 ◽  
Vol 12 (1) ◽  
Author(s):  
Yeye Zhao ◽  
Yuanqing Si ◽  
Longfei Mei ◽  
Jiadi Wu ◽  
Jing Shao ◽  
...  

Abstract Objectives The purpose of this experiment is to analyze the changes of transcriptome in Pseudomonas aeruginosa under the action of sodium houttuyfonate (SH) to reveal the possible mechanism of SH inhibiting P. aeruginosa. We analyzed these data in order to compare the transcriptomic differences of P. aeruginosa in SH treatment and blank control groups. Data description In this project, RNA-seq of BGISEQ-500 platform was used to sequence the transcriptome of P. aeruginosa, and sequencing data of 8 samples of P. aeruginosa are generated as follows: SH treatment (SH1, SH2, SH3, SH4), negative control (Control 1, Control 2, Control 3, Control 4). Quality control is carried out on raw reads to determine whether the sequencing data is suitable for subsequent analysis. Totally 170.53 MB of transcriptome sequencing data is obtained. Then the filtered clean reads are aligned and compared to the reference genome to proceed second quality control. After completion, 5938 genes are assembled from sequencing data. Further quantitative analysis of genes and screening of differentially expressed genes based on gene expression level reveals that there are 2047 significantly differentially expressed genes under SH treatment, including 368 up-regulated genes and 1679 down-regulated genes.


2016 ◽  
Author(s):  
Joseph Ward ◽  
Christian Cole ◽  
Melanie Febrer ◽  
Geoffrey Barton

AbstractMotivationThe current generation of DNA sequencing technologies produce a large amount of data quickly. All of these data need to pass some form of quality control processing and checking before they can be used for any analysis. The large number of samples that are run through Illumina sequencing machines makes the process of quality control an onerous and time-consuming task that requires multiple pieces of information from several sources.ResultsAlmostSignificant is an open-source platform for aggregating multiple sources of quality metrics as well as meta-data associated with DNA sequencing runs from Illumina sequencing machines. AlmostSignificant is a graphical platform to streamline the quality control of DNA sequencing data, to collect and store these data for future reference and to collect extra meta-data associated with the sequencing runs to check for errors and monitor the volume of data produced by the associated machines. AlmostSignificant has been used to track the quality of over 80 sequencing runs covering over 2500 samples produced over the last three years.AvailabilityThe code and documentation for AlmostSignificant is freely available at https://github.com/bartongroup/[email protected], [email protected]


2021 ◽  
Author(s):  
Jianhong Hu ◽  
Viktoriya Korchina ◽  
Hana Zouk ◽  
Maegan V. Harden ◽  
David Murdock ◽  
...  

Background: Next generation DNA sequencing (NGS) has been rapidly adopted by clinical testing laboratories for detection of germline and somatic genetic variants. The complexity of sample processing in a clinical DNA sequencing laboratory creates multiple opportunities for sample identification errors, demanding stringent quality control procedures. Methods: We utilized DNA genotyping via a 96-SNP PCR panel applied at sample acquisition in comparison to the final sequence, for tracking of sample identity throughout the sequencing pipeline. The 96-SNP PCR panel's inclusion of sex SNPs also provides a mechanism for a genotype-based comparison to recorded sex at sample collection for identification. This approach was implemented in the clinical genomic testing pathways, in the multi-center Electronic Medical Records and Genomics (eMERGE) Phase III program. Results: We identified 110 inconsistencies from 25,015 (0.44%) clinical samples, when comparing the 96-SNP PCR panel data to the test requisition-provided sex. The 96-SNP PCR panel genetic sex predictions were confirmed using additional SNP sites in the sequencing data or high-density hybridization-based genotyping arrays. Results identified clerical errors, samples from transgender participants and stem cell or bone marrow transplant patients and undetermined sample mix-ups. Conclusion: The 96-SNP PCR panel provides a cost-effective, robust tool for tracking samples within DNA sequencing laboratories, while the ability to predict sex from genotyping data provides an additional quality control measure for all procedures, beginning with sample collections. While not sufficient to detect all sample mix-ups, the inclusion of genetic versus reported sex matching can give estimates of the rate of errors in sample collection systems.


2018 ◽  
Author(s):  
Wang Xi ◽  
Yan Gao ◽  
Zhangyu Cheng ◽  
Chaoyun Chen ◽  
Maozhen Han ◽  
...  

ABSTRACTQuality control in next generation sequencing has become increasingly important as the technique becomes widely used. Tools have been developed for filtering possible contaminants in the sequencing data of species with known reference genome. Unfortunately, reference genomes for all the species involved, including the contaminants, are required for these tools to work. This precludes many real-life samples that have no information about the complete genome of the target species, and are contaminated with unknown microbial species.In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. The pipeline requires only very little information from the marker genes of the target species. The entire pipeline consists of unsupervised read assembly, contig binning, read clustering and marker gene assignment.When evaluated onin silico,ab initioandin vivodatasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind could serve well in situations where limited information is available for both target and contamination species.IMPORTANCEAt present, many sequencing projects are still performed on potentially contaminated samples, which bring into question their accuracies. However, current reference-based quality control method are limited as they need either the genome of target species or contaminations. In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. When evaluated onin silico,ab initioandin vivodatasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind is suitable for real-life samples where limited information is available for both target and contamination species.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Leah L. Weber ◽  
Mohammed El-Kebir

Abstract Background Cancer arises from an evolutionary process where somatic mutations give rise to clonal expansions. Reconstructing this evolutionary process is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. In particular, classifying a tumor’s evolutionary process as either linear or branched and understanding what cancer types and which patients have each of these trajectories could provide useful insights for both clinicians and researchers. While comprehensive cancer phylogeny inference from single-cell DNA sequencing data is challenging due to limitations with current sequencing technology and the complexity of the resulting problem, current data might provide sufficient signal to accurately classify a tumor’s evolutionary history as either linear or branched. Results We introduce the Linear Perfect Phylogeny Flipping (LPPF) problem as a means of testing two alternative hypotheses for the pattern of evolution, which we prove to be NP-hard. We develop Phyolin, which uses constraint programming to solve the LPPF problem. Through both in silico experiments and real data application, we demonstrate the performance of our method, outperforming a competing machine learning approach. Conclusion Phyolin is an accurate, easy to use and fast method for classifying an evolutionary trajectory as linear or branched given a tumor’s single-cell DNA sequencing data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 14 ◽  
pp. CIN.S26470 ◽  
Author(s):  
Richard P. Finney ◽  
Qing-Rong Chen ◽  
Cu V. Nguyen ◽  
Chih Hao Hsu ◽  
Chunhua Yan ◽  
...  

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .


2021 ◽  
Author(s):  
Myung-Shin Kim ◽  
Taeyoung Lee ◽  
Jeonghun Baek ◽  
Ji Hong Kim ◽  
Changhoon Kim ◽  
...  

AbstractMassive resequencing efforts have been undertaken to catalog allelic variants in major crop species including soybean, but the scope of the information for genetic variation often depends on short sequence reads mapped to the extant reference genome. Additional de novo assembled genome sequences provide a unique opportunity to explore a dispensable genome fraction in the pan-genome of a species. Here, we report the de novo assembly and annotation of Hwangkeum, a popular soybean cultivar in Korea. The assembly was constructed using PromethION nanopore sequencing data and two genetic maps, and was then error-corrected using Illumina short-reads and PacBio SMRT reads. The 933.12 Mb assembly was annotated 79,870 transcripts for 58,550 genes using RNA-Seq data and the public soybean annotation set. Comparison of the Hwangkeum assembly with the Williams 82 soybean reference genome sequence revealed 1.8 million single-nucleotide polymorphisms, 0.5 million indels, and 25 thousand putative structural variants. However, there was no natural megabase-scale chromosomal rearrangement. Incidentally, by adding two novel groups, we found that soybean contains four clearly separated groups of centromeric satellite repeats. Analyses of satellite repeats and gene content suggested that the Hwangkeum assembly is a high-quality assembly. This was further supported by comparison of the marker arrangement of anthocyanin biosynthesis genes and of gene arrangement at the Rsv3 locus. Therefore, the results indicate that the de novo assembly of Hwangkeum is a valuable additional reference genome resource for characterizing traits for the improvement of this important crop species.


Sign in / Sign up

Export Citation Format

Share Document