Chloroplast Genomes of Two Species of Cypripedium: Expanded Genome Size and Proliferation of AT-Biased Repeat Sequences

The size of the chloroplast genome (plastome) of autotrophic angiosperms is generally conserved. However, the chloroplast genomes of some lineages are greatly expanded, which may render assembling these genomes from short read sequencing data more challenging. Here, we present the sequencing, assembly, and annotation of the chloroplast genomes of Cypripedium tibeticum and Cypripedium subtropicum. We de novo assembled the chloroplast genomes of the two species with a combination of short-read Illumina data and long-read PacBio data. The plastomes of the two species are characterized by expanded genome size, proliferated AT-rich repeat sequences, low GC content and gene density, as well as low substitution rates of the coding genes. The plastomes of C. tibeticum (197,815 bp) and C. subtropicum (212,668 bp) are substantially larger than those of the three species sequenced in previous studies. The plastome of C. subtropicum is the longest one of Orchidaceae to date. Despite the increase in genome size, the gene order and gene number of the plastomes are conserved, with the exception of an ∼75 kb large inversion in the large single copy (LSC) region shared by the two species. The most striking is the record-setting low GC content in C. subtropicum (28.2%). Moreover, the plastome expansion of the two species is strongly correlated with the proliferation of AT-biased non-coding regions: the non-coding content of C. subtropicum is in excess of 57%. The genus provides a typical example of plastome expansion induced by the expansion of non-coding regions. Considering the pros and cons of different sequencing technologies, we recommend hybrid assembly based on long and short reads applied to the sequencing of plastomes with AT-biased base composition.

Download Full-text

De Novo Sequencing and Hybrid Assembly of the Biofuel Crop Jatropha curcas L.: Identification of Quantitative Trait Loci for Geminivirus Resistance

Genes ◽

10.3390/genes10010069 ◽

2019 ◽

Vol 10 (1) ◽

pp. 69 ◽

Cited By ~ 9

Author(s):

Nagesh Kancharla ◽

Saakshi Jalali ◽

J. Narasimham ◽

Vinod Nair ◽

Vijay Yepuri ◽

...

Keyword(s):

Ssr Markers ◽

Genome Assembly ◽

Jatropha Curcas ◽

Quantitative Trait ◽

De Novo ◽

Mapping Population ◽

Single Copy ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Sequencing Technologies

Jatropha curcas is an important perennial, drought tolerant plant that has been identified as a potential biodiesel crop. We report here the hybrid de novo genome assembly of J. curcas generated using Illumina and PacBio sequencing technologies, and identification of quantitative loci for Jatropha Mosaic Virus (JMV) resistance. In this study, we generated scaffolds of 265.7 Mbp in length, which correspond to 84.8% of the gene space, using Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis. Additionally, 96.4% of predicted protein-coding genes were captured in RNA sequencing data, which reconfirms the accuracy of the assembled genome. The genome was utilized to identify 12,103 dinucleotide simple sequence repeat (SSR) markers, which were exploited in genetic diversity analysis to identify genetically distinct lines. A total of 207 polymorphic SSR markers were employed to construct a genetic linkage map for JMV resistance, using an interspecific F2 mapping population involving susceptible J. curcas and resistant Jatropha integerrima as parents. Quantitative trait locus (QTL) analysis led to the identification of three minor QTLs for JMV resistance, and the same has been validated in an alternate F2 mapping population. These validated QTLs were utilized in marker-assisted breeding for JMV resistance. Comparative genomics of oil-producing genes across selected oil producing species revealed 27 conserved genes and 2986 orthologous protein clusters in Jatropha. This reference genome assembly gives an insight into the understanding of the complex genetic structure of Jatropha, and serves as source for the development of agronomically improved virus-resistant and oil-producing lines.

Download Full-text

W2RAP: a pipeline for high quality, robust assemblies of large complex genomes from short read data

10.1101/110999 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bernardo J. Clavijo ◽

Gonzalo Garcia Accinelli ◽

Jonathan Wright ◽

Darren Heavens ◽

Katie Barr ◽

...

Keyword(s):

De Novo ◽

Low Cost ◽

Cost Effective ◽

Data Generation ◽

Sequencing Data ◽

High Quality ◽

Crop Species ◽

Short Read ◽

Link Type ◽

Sequencing Technologies

AbstractProducing high-quality whole-genome shotgun de novo assemblies from plant and animal species with large and complex genomes using low-cost short read sequencing technologies remains a challenge. But when the right sequencing data, with appropriate quality control, is assembled using approaches focused on robustness of the process rather than maximization of a single metric such as the usual contiguity estimators, good quality assemblies with informative value for comparative analyses can be produced. Here we present a complete method described from data generation and qc all the way up to scaffold of complex genomes using Illumina short reads and its application to data from plants and human datasets. We show how to use the w2rap pipeline following a metric-guided approach to produce cost-effective assemblies. The assemblies are highly accurate, provide good coverage of the genome and show good short range contiguity. Our pipeline has already enabled the rapid, cost-effective generation of de novo genome assemblies from large, polyploid crop species with a focus on comparative genomics.Availabilityw2rap is available under MIT license, with some subcomponents under GPL-licenses. A ready-to-run docker with all software pre-requisites and example data is also available.http://github.com/bioinfologics/w2raphttp://github.com/bioinfologics/w2rap-contigger

Download Full-text

A comparative analysis of the complete chloroplast genomes of three Chrysanthemum boreale strains

PeerJ ◽

10.7717/peerj.9448 ◽

2020 ◽

Vol 8 ◽

pp. e9448

Author(s):

Swati Tyagi ◽

Jae-A Jung ◽

Jung Sun Kim ◽

So Youn Won

Keyword(s):

Phylogenetic Analysis ◽

De Novo ◽

Evolutionary Relationship ◽

Gc Content ◽

Rrna Genes ◽

Trna Genes ◽

Nucleotide Polymorphisms ◽

Coding Regions ◽

Chloroplast Genomes ◽

Chrysanthemum Boreale

Background Chrysanthemum boreale Makino (Anthemideae, Asteraceae) is a plant of economic, ornamental and medicinal importance. We characterized and compared the chloroplast genomes of three C. boreale strains. These were collected from different geographic regions of Korea and varied in floral morphology. Methods The chloroplast genomes were obtained by next-generation sequencing techniques, assembled de novo, annotated, and compared with one another. Phylogenetic analysis placed them within the Anthemideae tribe. Results The sizes of the complete chloroplast genomes of the C. boreale strains were 151,012 bp (strain 121002), 151,098 bp (strain IT232531) and 151,010 bp (strain IT301358). Each genome contained 80 unique protein-coding genes, 4 rRNA genes and 29 tRNA genes. Comparative analyses revealed a high degree of conservation in the overall sequence, gene content, gene order and GC content among the strains. We identified 298 single nucleotide polymorphisms (SNPs) and 106 insertions/deletions (indels) in the chloroplast genomes. These variations were more abundant in non-coding regions than in coding regions. Long dispersed repeats and simple sequence repeats were present in both coding and noncoding regions, with greater frequency in the latter. Regardless of their location, these repeats can be used for molecular marker development. Phylogenetic analysis revealed the evolutionary relationship of the species in the Anthemideae tribe. The three complete chloroplast genomes will be valuable genetic resources for studying the population genetics and evolutionary relationships of Asteraceae species.

Download Full-text

Complete Chloroplast Genome Sequencing and Phylogenetic Analysis of Two Dracocephalum Plants

BioMed Research International ◽

10.1155/2020/4374801 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Junjun Yao ◽

Fangyu Zhao ◽

Yuanjiang Xu ◽

Kaihui Zhao ◽

Hong Quan ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Chloroplast Genome ◽

De Novo ◽

Gc Content ◽

Single Copy ◽

Rrna Genes ◽

Trna Genes ◽

Complete Chloroplast Genome ◽

Ssr Analysis ◽

Chloroplast Genomes

Dracocephalum tanguticum and Dracocephalum moldavica are important herbs from Lamiaceae and have great medicinal value. We used the Illumina sequencing technology to sequence the complete chloroplast genome of D. tanguticum and D. moldavica and then conducted de novo assembly. The two chloroplast genomes have a typical quadripartite structure, with the gene’s lengths of 82,221 bp and 81,450 bp, large single-copy region’s (LSC) lengths of 82,221 bp and 81,450 bp, and small single-copy region’s (SSC) lengths of 17,363 bp and 17,066 bp, inverted repeat region’s (IR) lengths of 51,370 bp and 51,352 bp, respectively. The GC content of the two chloroplast genomes was 37.80% and 37.83%, respectively. The chloroplast genomes of the two plants encode 133 and 132 genes, respectively, among which there are 88 and 87 protein-coding genes, respectively, as well as 37 tRNA genes and 8 rRNA genes. Among them, the rps2 gene is unique to D. tanguticum, which is not found in D. moldavica. Through SSR analysis, we also found 6 mutation hotspot regions, which can be used as molecular markers for taxonomic studies. Phylogenetic analysis showed that Dracocephalum was more closely related to Mentha.

Download Full-text

Comparative Analysis of the Complete Chloroplast Genome of Mainland Aster spathulifolius and Other Aster Species

Plants ◽

10.3390/plants9050568 ◽

2020 ◽

Vol 9 (5) ◽

pp. 568

Author(s):

Swati Tyagi ◽

Jae-A Jung ◽

Jung Sun Kim ◽

So Youn Won

Keyword(s):

Chloroplast Genome ◽

Gc Content ◽

Single Copy ◽

The Other ◽

Complete Chloroplast Genome ◽

Coding Regions ◽

Chloroplast Genomes ◽

Redundant Genes ◽

Intron Structure ◽

Contraction And Expansion

Aster spathulifolius, a common ornamental and medicinal plant, is widely distributed in Korea and Japan, and is genetically classified into mainland and island types. Here, we sequenced the whole chloroplast genome of mainland A. spathulifolius and compared it with those of the island type and other Aster species. The chloroplast genome of mainland A. spathulifolius is 152,732 bp with a conserved quadripartite structure, has 37.28% guanine-cytosine (GC) content, and contains 114 non-redundant genes. Comparison of the chloroplast genomes between the two A. spathulifolius lines and the other Aster species revealed that their sequences, GC contents, gene contents and orders, and exon-intron structure were well conserved; however, differences were observed in their lengths, repeat sequences, and the contraction and expansion of the inverted repeats. The variations were mostly in the single-copy regions and non-coding regions, which, together with the detected simple sequence repeats, could be used for the development of molecular markers to distinguish between these plants. All Aster species clustered into a monophyletic group, but the chloroplast genome of mainland A. spathulifolius was more similar to the other Aster species than to that of the island A. spathulifolius. The accD and ndhF genes were detected to be under positive selection within the Aster lineage compared to other related taxa. The complete chloroplast genome of mainland A. spathulifolius presented in this study will be helpful for species identification and the analysis of the genetic diversity, evolution, and phylogenetic relationships in the Aster genus and the Asteraceae.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

Facile, High Quality Sequencing of Bacterial Genomes from Small Amounts of DNA

International Journal of Genomics ◽

10.1155/2014/434575 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8

Author(s):

Momchilo Vuyisich ◽

Ayesha Arefin ◽

Karen Davenport ◽

Shihai Feng ◽

Cheryl Gleasner ◽

...

Keyword(s):

Genomic Dna ◽

De Novo ◽

Gc Content ◽

Library Preparation ◽

Sequencing Data ◽

Bacterial Genomes ◽

Dna Amount ◽

High Quality ◽

Preparation Methods

Sequencing bacterial genomes has traditionally required large amounts of genomic DNA (~1 μg). There have been few studies to determine the effects of the input DNA amount or library preparation method on the quality of sequencing data. Several new commercially available library preparation methods enable shotgun sequencing from as little as 1 ng of input DNA. In this study, we evaluated the NEBNext Ultra library preparation reagents for sequencing bacterial genomes. We have evaluated the utility of NEBNext Ultra for resequencing andde novoassembly of four bacterial genomes and compared its performance with the TruSeq library preparation kit. The NEBNext Ultra reagents enable high quality resequencing andde novoassembly of a variety of bacterial genomes when using 100 ng of input genomic DNA. For the two most challenging genomes (Burkholderiaspp.), which have the highest GC content and are the longest, we also show that the quality of both resequencing andde novoassembly is not decreased when only 10 ng of input genomic DNA is used.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

Complete Chloroplast Genomes from Sanguisorba: Identity and Variation Among Four Species

Molecules ◽

10.3390/molecules23092137 ◽

2018 ◽

Vol 23 (9) ◽

pp. 2137 ◽

Cited By ~ 6

Author(s):

Xiang-Xiao Meng ◽

Yan-Fang Xian ◽

Li Xiang ◽

Dong Zhang ◽

Yu-Hua Shi ◽

...

Keyword(s):

Gc Content ◽

Single Copy ◽

Rrna Genes ◽

Trna Genes ◽

Protein Coding ◽

Future Studies ◽

Chloroplast Genomes ◽

Close Relationship ◽

Cp Genome ◽

Sanguisorba Officinalis

The genus Sanguisorba, which contains about 30 species around the world and seven species in China, is the source of the medicinal plant Sanguisorba officinalis, which is commonly used as a hemostatic agent as well as to treat burns and scalds. Here we report the complete chloroplast (cp) genome sequences of four Sanguisorba species (S. officinalis, S. filiformis, S. stipulata, and S. tenuifolia var. alba). These four Sanguisorba cp genomes exhibit typical quadripartite and circular structures, and are 154,282 to 155,479 bp in length, consisting of large single-copy regions (LSC; 84,405–85,557 bp), small single-copy regions (SSC; 18,550–18,768 bp), and a pair of inverted repeats (IRs; 25,576–25,615 bp). The average GC content was ~37.24%. The four Sanguisorba cp genomes harbored 112 different genes arranged in the same order; these identical sections include 78 protein-coding genes, 30 tRNA genes, and four rRNA genes, if duplicated genes in IR regions are counted only once. A total of 39–53 long repeats and 79–91 simple sequence repeats (SSRs) were identified in the four Sanguisorba cp genomes, which provides opportunities for future studies of the population genetics of Sanguisorba medicinal plants. A phylogenetic analysis using the maximum parsimony (MP) method strongly supports a close relationship between S. officinalis and S. tenuifolia var. alba, followed by S. stipulata, and finally S. filiformis. The availability of these cp genomes provides valuable genetic information for future studies of Sanguisorba identification and provides insights into the evolution of the genus Sanguisorba.

Download Full-text