scholarly journals The Evidential Statistics of Genetic Assembly: Bootstrapping a Reference Sequence

2021 ◽  
Vol 9 ◽  
Author(s):  
Yukihiko Toquenaga ◽  
Takuya Gagné

The reference sequences play an essential role in genome assembly, like type specimens in taxonomy. Those references are also samples obtained at some time and location with a specific method. How can we evaluate or discriminate uncertainties of the reference itself and assembly methods? Here we bootstrapped 50 random read data sets from a small circular genome of a Escherichia coli bacteriophage, phiX174, and tried to reconstruct the reference with 14 free assembly programs. Nine out of 14 assembly programs were capable of circular genome reconstruction. Unicycler correctly reconstructed the reference for 44 out of 50 data sets, but each reconstructed contig of the failed six data sets had minor defects. The other assembly software could reconstruct the reference with minor defects. The defect regions differed among the assembly programs, and the defect locations were far from randomly distributed in the reference genome. All contigs of Trinity included one, but Minia had two perfect copies other than an imperfect reference copy. The centroid of contigs for assembly programs except Unicycler differed from the reference with 75bases at most. Nonmetric multidimensional scaling (NMDS) plots of the centroids indicated that even the reference sequence was located slightly off from the estimated location of the true reference. We propose that the combination of bootstrapping a reference, making consensus contigs as centroids in an edit distance, and NMDS plotting will provide an evidential statistic way of genetic assembly for non-fragmented base sequences.

2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Krisztian Buza ◽  
Bartek Wilczynski ◽  
Norbert Dojer

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.


Elem Sci Anth ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Georgios I. Gkatzelis ◽  
Jessica B. Gilman ◽  
Steven S. Brown ◽  
Henk Eskes ◽  
A. Rita Gomes ◽  
...  

The coronavirus-19 (COVID-19) pandemic led to government interventions to limit the spread of the disease which are unprecedented in recent history; for example, stay at home orders led to sudden decreases in atmospheric emissions from the transportation sector. In this review article, the current understanding of the influence of emission reductions on atmospheric pollutant concentrations and air quality is summarized for nitrogen dioxide (NO2), particulate matter (PM2.5), ozone (O3), ammonia, sulfur dioxide, black carbon, volatile organic compounds, and carbon monoxide (CO). In the first 7 months following the onset of the pandemic, more than 200 papers were accepted by peer-reviewed journals utilizing observations from ground-based and satellite instruments. Only about one-third of this literature incorporates a specific method for meteorological correction or normalization for comparing data from the lockdown period with prior reference observations despite the importance of doing so on the interpretation of results. We use the government stringency index (SI) as an indicator for the severity of lockdown measures and show how key air pollutants change as the SI increases. The observed decrease of NO2 with increasing SI is in general agreement with emission inventories that account for the lockdown. Other compounds such as O3, PM2.5, and CO are also broadly covered. Due to the importance of atmospheric chemistry on O3 and PM2.5 concentrations, their responses may not be linear with respect to primary pollutants. At most sites, we found O3 increased, whereas PM2.5 decreased slightly, with increasing SI. Changes of other compounds are found to be understudied. We highlight future research needs for utilizing the emerging data sets as a preview of a future state of the atmosphere in a world with targeted permanent reductions of emissions. Finally, we emphasize the need to account for the effects of meteorology, emission trends, and atmospheric chemistry when determining the lockdown effects on pollutant concentrations.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.


Author(s):  
Fabian Sievers ◽  
Desmond G Higgins

Abstract Motivation Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. Results We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. Availability and implementation QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. Supplementary information Supplementary data are available at Bioinformatics online


Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 605-605
Author(s):  
Marco A. Marra ◽  
Martin Krzywinski ◽  
Readman Chiu ◽  
Matthew Field ◽  
Inanc Birol ◽  
...  

Abstract With the aim of identifying and sequencing mutations in follicular lymphoma genomes, we have begun a project to generate at least 24 deeply redundant sequence-ready Bacterial Artificial Clone (BAC) - based whole genome maps, each from a different individual’s lymphoma. BAC-array CGH and Affymetrix whole-genome sampling assays (WGSA) will be used along with the mapping data to identify genomic amplifications and losses in the lymphomas. Results from the mapping and array studies will be used to prioritize BAC clones for sequence analysis. Because each map will span essentially the entire genome of the corresponding lymphoma, we anticipate that essentially all regions of each tumor genome will be represented in easily sequenced BAC clones. This approach facilitates targeted sequencing of genomic regions of interest, including those containing genes relevant to cancer or harboring amplifications or deletions. Our mapping strategy hinges on the successful creation of deeply redundant high quality BAC libraries from primary lymphomas and large scale high throughput restriction enzyme fingerprinting of individual BACs with a version of the technology we used to map the human, mouse, rat and other genomes. The effort is large-scale, and will result in the generation of at least 2.5 million fingerprinted BAC clones over the next three years. Using the fingerprints, we will align the BACs to the reference human genome to assess genome coverage and to identify candidate genome rearrangements. In parallel, we will assemble the fingerprints into genome maps, looking for larger-scale genome variations between the lymphoma maps and the reference genome sequence. To test the feasibility of our approach, we obtained two restriction digest fingerprints from each of 140,000 individual BAC clones. BACs were sampled from a 7-fold redundant BAC library that had been created from genomic DNA purified from a primary follicular lymphoma sample. The fingerprints are being assembled into a clone map with the intent of reconstructing the entire tumor genome. 90,377 fingerprinted clones with unambiguous single alignments to the reference sequence were automatically assembled into 15,538 contigs. Subsequent rounds of semi-automatic contig merging further reduced the number of contigs to 5,433. Only 1,241 clones remained unassembled. We anchored the tumor genome map to the reference human genome sequence by aligning the clone fingerprints to the restriction map computed from the reference sequence assembly. As a result of this, we identified a BAC that captured the canonical t(14;18) translocation characteristic of follicular lymphomas. We sequenced this BAC and confirmed that it contains the expected translocation. Almost 2.6 gigabases (~91%) of the reference genome are represented in the evolving map, with an additional 50,000 clone fingerprints awaiting incorporation into the map assembly. Among these are repeat-rich and other clones that may well harbor genome rearrangements. Additional prioritization of sequencing targets will be undertaken when map construction and analysis of genome copy number alterations are complete.


2014 ◽  
Vol 610 ◽  
pp. 905-909
Author(s):  
Chun Hui Ren ◽  
Zhong Quan He

The capture of long Pseudo-Code code is the most important technology in spread spectrum system. we use XFAST,AVERAGE to solve this problem in old days. A new algorithm is proposed which based on the time domain samples and binary search according the autocorrelation of the Pseudo-Code (PN code) and improve the speed of the capture of long Pseudo-Code code in spread spectrum system. Firstly, received spread spectrum signal's simple rate is reduced to a quarter of the chip rate and determine with a specific method, then divide the local PN code into four parts and accumulated to a new sequence. finally, the synchronous pseudo-code is captured with the correlation of the two new reference sequences. Compared with conventional methods such as XFAST, capture time and precision are improved.


2016 ◽  
Author(s):  
Zhikai Liang ◽  
James C Schnable

B73 is a variety of maize (Zea mays ssp. mays) widely used in genetic, genomic, and phenotypic research around the world. B73 was also served as the reference genotype for the original maize genome sequencing project. The advent of large-scale RNA-sequencing as a method of measuring gene expression presents a unique opportunity to assess the level of relatedness among individuals identified as variety B73. The level of haplotype conservation and divergence across the genome were assessed using 27 RNA-seq data sets from 20 independent research groups in three countries. Several clearly distinct clades were identified among putatively B73 samples. A number of these blocks were defined by the presence of clearly defined genomic blocks containing a haplotype which did not match the published B73 reference genome. In a number of cases the relationship among B73 samples generated by different research groups recapitulated mentor/mentee relationships within the maize genetics community. A number of regions with distinct, dissimilar, haplotypes were identified in our study. However, when considering the age of the B73 accession -- greater than 40 years -- and the challenges of maintaining isogenic lines of a naturally outcrossing species, a strikingly high overall level of conservation was exhibited among B73 samples from around the globe.


2016 ◽  
Author(s):  
Afif Elghraoui ◽  
Samuel J Modlin ◽  
Faramarz Valafar

AbstractThe genetic basis of virulence in Mycobacterium tuberculosis has been investigated through genome comparisons of its virulent (H37Rv) and attenuated (H37Ra) sister strains. Such analysis, however, relies heavily on the accuracy of the sequences. While the H37Rv reference genome has had several corrections to date, that of H37Ra is unmodified since its original publication. Here, we report the assembly and finishing of the H37Ra genome from single-molecule, real-time (SMRT) sequencing. Our assembly reveals that the number of H37Ra-specific variants is less than half of what the Sanger-based H37Ra reference sequence indicates, undermining and, in some cases, invalidating the conclusions of several studies. PE_PPE family genes, which are intractable to commonly-used sequencing platforms because of their repetitive and GC-rich nature, are overrepresented in the set of genes in which all reported H37Ra-specific variants are contradicted. We discuss how our results change the picture of virulence attenuation and the power of SMRT sequencing for producing high-quality reference genomes.


2020 ◽  
Vol 8 (6) ◽  
pp. 4253-4259

Number of assembly algorithms have emerged out but due to constraints of genome sequencing techniques no one is perfect. Various methods for assembler’s comparison have been developed, but none is yet a recognized standard. The problem of evaluating assemblies of formerly unsequenced species has not been considered, because mostly existing methods for comparing assemblies are only applicable to new assemblies of finished genomes. For comparing and evaluating genome assemblies we have used QUAST (Quality Assessment Tool). This tool is used to assess the quality of leading assembly software by evaluating quality metrics. Assemblies with a reference genome, as well as without a reference can be evaluated by QUAST tool. For genome assembly evaluation based on alignment of contigs to a reference, it is a modern tool. In this study we demonstrate QUAST performance by comparing several leading genome assemblers on three metagenomic datasets.


Sign in / Sign up

Export Citation Format

Share Document