sequence assembly
Recently Published Documents


TOTAL DOCUMENTS

251
(FIVE YEARS 32)

H-INDEX

41
(FIVE YEARS 4)

2021 ◽  
Vol 22 (S6) ◽  
Author(s):  
Xuan Zhang ◽  
Yuansheng Liu ◽  
Zuguo Yu ◽  
Michael Blumenstein ◽  
Gyorgy Hutvagner ◽  
...  

Abstract Background Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. Results To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). Conclusion Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.


2021 ◽  
Vol 10 (17) ◽  
Author(s):  
Thidathip Wongsurawat ◽  
Nuntaya Punyadee ◽  
Piroon Jenjaroenpun ◽  
Dumrong Mairiang ◽  
Nattaya Tangthawornchaikul ◽  
...  

ABSTRACT We present RNA sequencing data sets and their genome sequence assembly for dengue virus that was isolated from a patient with dengue hemorrhagic fever and serially propagated in Vero cells. RNA sequencing data obtained from the first, third, and fifth passages and their corresponding whole-genome sequences are provided in this work.


Genes ◽  
2021 ◽  
Vol 12 (4) ◽  
pp. 518
Author(s):  
Zequn Chen ◽  
Xiwu Qi ◽  
Xu Yu ◽  
Ying Zheng ◽  
Zhiqi Liu ◽  
...  

Terpenoids are a wide variety of natural products and terpene synthase (TPS) plays a key role in the biosynthesis of terpenoids. Mentha plants are rich in essential oils, whose main components are terpenoids, and their biosynthetic pathways have been basically elucidated. However, there is a lack of systematic identification and study of TPS in Mentha plants. In this work, we genome-widely identified and analyzed the TPS gene family in Mentha longifolia, a model plant for functional genomic research in the genus Mentha. A total of 63 TPS genes were identified in the M. longifolia genome sequence assembly, which could be divided into six subfamilies. The TPS-b subfamily had the largest number of genes, which might be related to the abundant monoterpenoids in Mentha plants. The TPS-e subfamily had 18 members and showed a significant species-specific expansion compared with other sequenced Lamiaceae plant species. The 63 TPS genes could be mapped to nine scaffolds of the M. longifolia genome sequence assembly and the distribution of these genes is uneven. Tandem duplicates and fragment duplicates contributed greatly to the increase in the number of TPS genes in M. longifolia. The conserved motifs (RR(X)8W, NSE/DTE, RXR, and DDXXD) were analyzed in M. longifolia TPSs, and significant differentiation was found between different subfamilies. Adaptive evolution analysis showed that M. longifolia TPSs were subjected to purifying selection after the species-specific expansion, and some amino acid residues under positive selection were identified. Furthermore, we also cloned and analyzed the catalytic activity of a single terpene synthase, MlongTPS29, which belongs to the TPS-b subfamily. MlongTPS29 could encode a limonene synthase and catalyze the biosynthesis of limonene, an important precursor of essential oils from the genus Mentha. This study provides useful information for the biosynthesis of terpenoids in the genus Mentha.


2021 ◽  
Author(s):  
Martin Mascher ◽  
Thomas Wicker ◽  
Jerry Jenkins ◽  
Christopher Plott ◽  
Thomas Lux ◽  
...  

Abstract Sequence assembly of large and repeat-rich plant genomes has been challenging, requiring substantial computational resources and often several complementary sequence assembly and genome mapping approaches. The recent development of fast and accurate long-read sequencing by circular consensus sequencing (CCS) on the PacBio platform may greatly increase the scope of plant pan-genome projects. Here, we compare current long-read sequencing platforms regarding their ability to rapidly generate contiguous sequence assemblies in pan-genome studies of barley (Hordeum vulgare). Most long-read assemblies are clearly superior to the current barley reference sequence based on short-reads. Assemblies derived from accurate long reads excel in most metrics, but the CCS approach was the most cost-effective strategy for assembling tens of barley genomes. A downsampling analysis indicated that 20-fold CCS coverage can yield very good sequence assemblies, while even 5-fold CCS data may capture the complete sequence of most genes. We present an updated reference genome assembly for barley with near-complete representation of the repeat-rich intergenic space. Long-read assembly can underpin the construction of accurate and complete sequences of multiple genomes of a species to build pan-genome infrastructures in Triticeae crops and their wild relatives.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Zachary Deng ◽  
Eric Delwart

Abstract Background Metagenomics is the study of microbial genomes for pathogen detection and discovery in human clinical, animal, and environmental samples via Next-Generation Sequencing (NGS). Metagenome de novo sequence assembly is a crucial analytical step in which longer contigs, ideally whole chromosomes/genomes, are formed from shorter NGS reads. However, the contigs generated from the de novo assembly are often very fragmented and rarely longer than a few kilo base pairs (kb). Therefore, a time-consuming extension process is routinely performed on the de novo assembled contigs. Results To facilitate this process, we propose a new tool for metagenome contig extension after de novo assembly. ContigExtender employs a novel recursive extending strategy that explores multiple extending paths to achieve highly accurate longer contigs. We demonstrate that ContigExtender outperforms existing tools in synthetic, animal, and human metagenomics datasets. Conclusions A novel software tool ContigExtender has been developed to assist and enhance the performance of metagenome de novo assembly. ContigExtender effectively extends contigs from a variety of sources and can be incorporated in most viral metagenomics analysis pipelines for a wide variety of applications, including pathogen detection and viral discovery.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Haihe Shi ◽  
Gang Wu

With the continuous development of sequencing technology, the amount of bioinformatics data has increased geometrically, and the massive amount of bioinformatics data puts forward more stringent requirements for sequence assembly problems. The sequence assembly algorithm based on DBG (De Bruijn graph) strategy is a key algorithm in bioinformatics, which is widely used in the domain of gene sequence assembly. Current research on the domain of sequence assembly always focuses on optimization of specific steps to a specific algorithm and lack of research on domain-level high-abstract algorithm frameworks. To some extent, it leads to the redundancy of the sequence assembly algorithm, and some problems may be caused by the artificial selection algorithm. This paper analyzes the domain of DBGSA and establishes a feature model of this domain. Based on the production programming method, the DBGSA algorithm component is interactively designed. With the support of the PAR platform, the DBGSA algorithm component library is formally implemented, and furthermore, the DBGSA component library is used to assemble the specific algorithm. This research adds domain-level research to the domain of sequence assembly and implements the DBGSA component library, which can assemble specific sequence assembly algorithms, ensuring the efficiency of algorithm development and the reliability of assembly generation algorithms. At the same time, it also provides a valuable reference for solving problems in the domain of sequence assembly.


Gigabyte ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Shivraj Braich ◽  
Rebecca C. Baillie ◽  
German C. Spangenberg ◽  
Noel O. I. Cogan

Cannabis is a diploid species (2n = 20), the estimated haploid genome sizes of the female and male plants using flow cytometry are 818 and 843 Mb respectively. Although the genome of Cannabis has been sequenced (from hemp, wild and high-THC strains), all assemblies have significant gaps. In addition, there are inconsistencies in the chromosome numbering which limits their use. A new comprehensive draft genome sequence assembly (∼900 Mb) has been generated from the medicinal cannabis strain Cannbio-2, that produces a balanced ratio of cannabidiol and delta-9-tetrahydrocannabinol using long-read sequencing. The assembly was subsequently analysed for completeness by ordering the contigs into chromosome-scale pseudomolecules using a reference genome assembly approach, annotated and compared to other existing reference genome assemblies. The Cannbio-2 genome sequence assembly was found to be the most complete genome sequence available based on nucleotides assembled and BUSCO evaluation in Cannabis sativa with a comprehensive genome annotation. The new draft genome sequence is an advancement in Cannabis genomics permitting pan-genome analysis, genomic selection as well as genome editing.


2020 ◽  
Author(s):  
Shivraj Braich ◽  
Rebecca C. Baillie ◽  
German Spangenberg ◽  
Noel O.I. Cogan

Cannabis is a diploid species (2n = 20), the estimated haploid genome sizes of the female and male plants using flow cytometry are 818 and 843 Mb respectively. Although the genome of Cannabis has been sequenced (from hemp, wild and high-THC strains), all assemblies have significant gaps. In addition, there are inconsistencies in the chromosome numbering which limits their use. A new comprehensive draft genome sequence assembly (~900 Mb) has been generated from the medicinal cannabis strain Cannbio-2, that produces a balanced ratio of cannabidiol and delta-9-tetrahydrocannabinol using long-read sequencing. The assembly was subsequently analysed for completeness by ordering the contigs into chromosome-scale pseudomolecules using a reference genome assembly approach, annotated and compared to other existing reference genome assemblies. The Cannbio-2 genome sequence assembly was found to be the most complete genome sequence available based on nucleotides assembled and BUSCO evaluation in Cannabis sativa with a comprehensive genome annotation. The new draft genome sequence is an advancement in Cannabis genomics permitting pan-genome analysis, genomic selection as well as genome editing.


Sign in / Sign up

Export Citation Format

Share Document