Chloroplast Genome Sequencing, Comparative Analysis, and Discovery of Unique Cytoplasmic Variants in Pomegranate (Punica granatum L.)

Here we report on comprehensive chloroplast (cp) genome analysis of 16 pomegranate (Punica granatum L.) genotypes representing commercial cultivars, ornamental and wild types, through large-scale sequencing and assembling using next-generation sequencing (NGS) technology. Comparative genome analysis revealed that the size of cp genomes varied from 158,593 bp (in wild, “1201” and “1181”) to 158,662 bp (cultivar, “Gul-e-Shah Red”) among the genotypes, with characteristic quadripartite structures separated by a pair of inverted repeats (IRs). The higher conservation for the total number of coding and non-coding genes (rRNA and tRNA) and their sizes, and IRs (IR-A and IR-B) were observed across all the cp genomes. Interestingly, high variations were observed in sizes of large single copy (LSC, 88,976 to 89,044 bp) and small single copy (SSC, 18,682 to 18,684 bp) regions. Although, the structural organization of newly assembled cp genomes were comparable to that of previously reported cp genomes of pomegranate (“Helow,” “Tunisia,” and “Bhagawa”), the striking differences were observed with the Lagerstroemia lines, viz., Lagerstroemia intermedia (NC_0346620) and Lagerstroemia speciosa (NC_031414), which clearly confirmed previous findings. Furthermore, phylogenetic analysis also revealed that members outside the genus Punica were clubbed into a separate clade. The contraction and expansion analysis revealed that the structural variations in IRs, LSC, and SSC have significantly accounted for the evolution of cp genomes of Punica and L. intermedia over the periods. Microsatellite survey across cp genomes resulted in the identification of a total of 233 to 234 SSRs, with majority of them being mono- (A/T or C/G, 164–165 numbers), followed by di- (AT/AT or AG/CT, 54), tri- (6), tetra- (8), and pentanucleotides (1). Furthermore, the comparative structural variant analyses across cp genomes resulted in the identification of many varietal specific SNP/indel markers. In summary, our study has offered a successful development of large-scale cp genomics resources to leverage future genetic, taxonomical, and phylogenetic studies in pomegranate.

Download Full-text

Complete Chloroplast Genomes of Chlorophytum comosum and Chlorophytum gallabatense: Genome Structures, Comparative and Phylogenetic Analysis

Plants ◽

10.3390/plants9030296 ◽

2020 ◽

Vol 9 (3) ◽

pp. 296 ◽

Cited By ~ 3

Author(s):

Jacinta N. Munyao ◽

Xiang Dong ◽

Jia-Xin Yang ◽

Elijah M. Mbandi ◽

Vincent O. Wanga ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Single Copy ◽

Rrna Genes ◽

Trna Genes ◽

Important Species ◽

Genomic Resources ◽

Phylogenetic Studies ◽

Chloroplast Genomes ◽

Cp Genome ◽

High Level

The genus Chlorophytum includes many economically important species well-known for medicinal, ornamental, and horticultural values. However, to date, few molecular genomic resources have been reported for this genus. Therefore, there is limited knowledge of phylogenetic studies, and the available chloroplast (cp) genome of Chlorophytum (C. rhizopendulum) does not provide enough information on this genus. In this study, we present genomic resources for C. comosum and C. gallabatense, which had lengths of 154,248 and 154,154 base pairs (bp), respectively. They had a pair of inverted repeats (IRa and IRb) of 26,114 and 26,254 bp each in size, separating the large single-copy (LSC) region of 84,004 and 83,686 bp from the small single-copy (SSC) region of 18,016 and 17,960 bp in C. comosum and C. gallabatense, respectively. There were 112 distinct genes in each cp genome, which were comprised of 78 protein-coding genes, 30 tRNA genes, and four rRNA genes. The comparative analysis with five other selected species displayed a generally high level of sequence resemblance in structural organization, gene content, and arrangement. Additionally, the phylogenetic analysis confirmed the previous phylogeny and produced a phylogenetic tree with similar topology. It showed that the Chlorophytum species (C. comosum, C. gallabatense and C. rhizopendulum) were clustered together in the same clade with a closer relationship than other plants to the Anthericum ramosum. This research, therefore, presents valuable records for further molecular evolutionary and phylogenetic studies which help to fill the gap in genomic resources and resolve the taxonomic complexes of the genus.

Download Full-text

Massive gene presence/absence variation in the mussel genome as an adaptive strategy: first evidence of a pan-genome in Metazoa

10.1101/781377 ◽

2019 ◽

Cited By ~ 7

Author(s):

Marco Gerdol ◽

Rebeca Moreira ◽

Fernando Cruz ◽

Jessica Gómez-Garrido ◽

Anna Vlasova ◽

...

Keyword(s):

Large Scale ◽

Single Copy ◽

Genomic Diversity ◽

Small Scale ◽

Nucleotide Polymorphisms ◽

Structural Variations ◽

Pan Genome ◽

Adaptive Value ◽

Mediterranean Mussel ◽

The Mediterranean

AbstractMussels are ecologically and economically relevant edible marine bivalves, highly invasive and resilient to biotic and abiotic stressors causing recurrent massive mortalities in other species. Here we show that the Mediterranean mussel Mytilus galloprovincialis has a complex pan-genomic architecture, which includes a core set of 45,000 genes shared by all individuals plus a surprisingly high number of dispensable genes (∼15,000). The latter are subject to presence/absence variation (PAV), i.e., they may be entirely missing in a given individual and, when present, they are frequently found as a single copy. The enrichment of dispensable genes in survival functions suggests an adaptive value for PAV, which might be the key to explain the extraordinary capabilities of adaptation and invasiveness of this species. Our study underpins a unique metazoan pan-genome architecture only previously described in prokaryotes and in a few non-metazoan eukaryotes, but that might also characterize other marine invertebrates.Significance statementIn animals, intraspecific genomic diversity is generally thought to derive from relatively small-scale variants, such as single nucleotide polymorphisms, small indels, duplications, inversions and translocations. On the other hand, large-scale structural variations which involve the loss of genomic regions encoding protein-coding genes in some individuals (i.e. presence/absence variation, PAV) have been so far only described in bacteria and, occasionally, in plants and fungi. Here we report the first evidence of a pan-genome in the animal kingdom, revealing that 25% of the genes of the Mediterranean mussel are subject to PAV. We show that this unique feature might have an adaptive value, due to the involvement of dispensable genes in functions related with defense and survival.

Download Full-text

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Download Full-text

3DIV update for 2021: a comprehensive resource of 3D genome and 3D cancer genome

Nucleic Acids Research ◽

10.1093/nar/gkaa1078 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D38-D46

Author(s):

Kyukwang Kim ◽

Insu Jang ◽

Mooyoung Kim ◽

Jinhyuk Choi ◽

Min-Seo Kim ◽

...

Keyword(s):

Cell Line ◽

Large Scale ◽

Three Dimensional ◽

Cancer Cell Line ◽

Cancer Genome ◽

Structural Variations ◽

3D Genome ◽

Tightly Coupled ◽

Regulatory Effects ◽

The Impact

Abstract Three-dimensional (3D) genome organization is tightly coupled with gene regulation in various biological processes and diseases. In cancer, various types of large-scale genomic rearrangements can disrupt the 3D genome, leading to oncogenic gene expression. However, unraveling the pathogenicity of the 3D cancer genome remains a challenge since closer examinations have been greatly limited due to the lack of appropriate tools specialized for disorganized higher-order chromatin structure. Here, we updated a 3D-genome Interaction Viewer and database named 3DIV by uniformly processing ∼230 billion raw Hi-C reads to expand our contents to the 3D cancer genome. The updates of 3DIV are listed as follows: (i) the collection of 401 samples including 220 cancer cell line/tumor Hi-C data, 153 normal cell line/tissue Hi-C data, and 28 promoter capture Hi-C data, (ii) the live interactive manipulation of the 3D cancer genome to simulate the impact of structural variations and (iii) the reconstruction of Hi-C contact maps by user-defined chromosome order to investigate the 3D genome of the complex genomic rearrangement. In summary, the updated 3DIV will be the most comprehensive resource to explore the gene regulatory effects of both the normal and cancer 3D genome. ‘3DIV’ is freely available at http://3div.kr.

Download Full-text

The Complete Chloroplast Genome of the Vulnerable Oreocharis esquirolii (Gesneriaceae): Structural Features, Comparative and Phylogenetic Analysis

Plants ◽

10.3390/plants9121692 ◽

2020 ◽

Vol 9 (12) ◽

pp. 1692

Author(s):

Li Gu ◽

Ting Su ◽

Ming-Tai An ◽

Guo-Xiong Hu

Keyword(s):

Phylogenetic Analysis ◽

Sequence Similarity ◽

Single Copy ◽

Structural Features ◽

Rrna Genes ◽

Trna Genes ◽

Sequencing Data ◽

High Sequence Similarity ◽

Plastid Genomes ◽

Cp Genome

Oreocharis esquirolii, a member of Gesneriaceae, is known as Thamnocharis esquirolii, which has been regarded a synonym of the former. The species is endemic to Guizhou, southwestern China, and is evaluated as vulnerable (VU) under the International Union for Conservation of Nature (IUCN) criteria. Until now, the sequence and genome information of O. esquirolii remains unknown. In this study, we assembled and characterized the complete chloroplast (cp) genome of O. esquirolii using Illumina sequencing data for the first time. The total length of the cp genome was 154,069 bp with a typical quadripartite structure consisting of a pair of inverted repeats (IRs) of 25,392 bp separated by a large single copy region (LSC) of 85,156 bp and a small single copy region (SSC) of18,129 bp. The genome comprised 114 unique genes with 80 protein-coding genes, 30 tRNA genes, and four rRNA genes. Thirty-one repeat sequences and 74 simple sequence repeats (SSRs) were identified. Genome alignment across five plastid genomes of Gesneriaceae indicated a high sequence similarity. Four highly variable sites (rps16-trnQ, trnS-trnG, ndhF-rpl32, and ycf 1) were identified. Phylogenetic analysis indicated that O. esquirolii grouped together with O. mileensis, supporting resurrection of the name Oreocharis esquirolii from Thamnocharisesquirolii. The complete cp genome sequence will contribute to further studies in molecular identification, genetic diversity, and phylogeny.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of structural variations in densely-labelled optical DNA barcodes: A hidden Markov model approach

PLoS ONE ◽

10.1371/journal.pone.0259670 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0259670

Author(s):

Albertas Dvirnas ◽

Callum Stewart ◽

Vilhelm Müller ◽

Santosh Kumar Bikkarolla ◽

Karolin Frykholm ◽

...

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Large Scale ◽

Hidden Markov ◽

Sequence Information ◽

True Positive ◽

Dna Barcodes ◽

Structural Variations ◽

Genomic Alterations ◽

Data Set

Large-scale genomic alterations play an important role in disease, gene expression, and chromosome evolution. Optical DNA mapping (ODM), commonly categorized into sparsely-labelled ODM and densely-labelled ODM, provides sequence-specific continuous intensity profiles (DNA barcodes) along single DNA molecules and is a technique well-suited for detecting such alterations. For sparsely-labelled barcodes, the possibility to detect large genomic alterations has been investigated extensively, while densely-labelled barcodes have not received as much attention. In this work, we introduce HMMSV, a hidden Markov model (HMM) based algorithm for detecting structural variations (SVs) directly in densely-labelled barcodes without access to sequence information. We evaluate our approach using simulated data-sets with 5 different types of SVs, and combinations thereof, and demonstrate that the method reaches a true positive rate greater than 80% for randomly generated barcodes with single variations of size 25 kilobases (kb). Increasing the length of the SV further leads to larger true positive rates. For a real data-set with experimental barcodes on bacterial plasmids, we successfully detect matching barcode pairs and SVs without any particular assumption of the types of SVs present. Instead, our method effectively goes through all possible combinations of SVs. Since ODM works on length scales typically not reachable with other techniques, our methodology is a promising tool for identifying arbitrary combinations of genomic alterations.

Download Full-text

SVFX: a machine-learning framework to quantify the pathogenicity of structural variants

10.1101/739474 ◽

2019 ◽

Cited By ~ 2

Author(s):

Sushant Kumar ◽

Arif Harmanci ◽

Jagath Vytheeswaran ◽

Mark B. Gerstein

Keyword(s):

Machine Learning ◽

Large Scale ◽

Three Dimensional ◽

Point Mutations ◽

Dimensional Structure ◽

Rapid Decline ◽

Ras Signaling ◽

Cancer Genes ◽

Structural Variations ◽

Pathogenic Variants

AbstractA rapid decline in sequencing cost has made large-scale genome sequencing studies feasible. One of the fundamental goals of these studies is to catalog all pathogenic variants. Numerous methods and tools have been developed to interpret point mutations and small insertions and deletions. However, there is a lack of approaches for identifying pathogenic genomic structural variations (SVs). That said, SVs are known to play a crucial role in many diseases by altering the sequence and three-dimensional structure of the genome. Previous studies have suggested a complex interplay of genomic and epigenomic features in the emergence and distribution of SVs. However, the exact mechanism of pathogenesis for SVs in different diseases is not straightforward to decipher. Thus, we built an agnostic machine-learning-based workflow, called SVFX, to assign a “pathogenicity score” to somatic and germline SVs in various diseases. In particular, we generated somatic and germline training models, which included genomic, epigenomic, and conservation-based features for SV call sets in diseased and healthy individuals. We then applied SVFX to SVs in six different cancer cohorts and a cardiovascular disease (CVD) cohort. Overall, SVFX achieved high accuracy in identifying pathogenic SVs. Moreover, we found that predicted pathogenic SVs in cancer cohorts were enriched among known cancer genes and many cancer-related pathways (including Wnt signaling, Ras signaling, DNA repair, and ubiquitin-mediated proteolysis). Finally, we note that SVFX is flexible and can be easily extended to identify pathogenic SVs in additional disease cohorts.

Download Full-text

Genome Assemblies of the Warthog and Kenyan Domestic Pig Provide Insights into Suidae Evolution and Candidate Genes for African Swine Fever Tolerance

10.1101/2021.12.17.473133 ◽

2021 ◽

Author(s):

Wen Feng ◽

Lei Zhou ◽

Pengju Zhao ◽

Heng Du ◽

Chenguang Diao ◽

...

Keyword(s):

Large Scale ◽

Genetic Resistance ◽

African Swine Fever ◽

Gene Families ◽

Specific Gene ◽

Chromosome 2 ◽

Sequencing Data ◽

Domestic Pig ◽

Phacochoerus Africanus ◽

Contraction And Expansion

As warthog (Phacochoerus africanus) has innate immunity against African swine fever (ASF), it is critical to understanding the evolutionary novelty of warthog to explain its specific ASF resistance. Here, we present two completed new genomes of one warthog and one Kenyan domestic pig, as the fundamental genomic references to decode the genetic mechanism on ASF tolerance. Our results indicated, multiple genomic variations, including gene losses, independent contraction and expansion of specific gene families, likely moulded warthog's genome to adapt the environment. Importantly, the analysis of presence and absence of genomic sequences revealed that, the warthog genome had a DNA sequence absence of the lactate dehydrogenase B (LDHB) gene on chromosome 2 compared to the reference genome. The overexpression and siRNA of LDHB indicated that its inhibition on the replication of ASFV. The Combining with large scale sequencing data of 123 pigs from all over world, contraction and expansion of TRIM genes families revealed that TRIM family genes in the warthog genome were potentially responsible for its tolerance to ASF. Our results will help further improve the understanding of genetic resistance ASF in pigs.

Download Full-text

Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes

10.1101/2021.10.15.464561 ◽

2021 ◽

Author(s):

Romain Feron ◽

Robert Michael Waterhouse

Keyword(s):

Large Scale ◽

Automated Analysis ◽

Single Copy ◽

The United States ◽

Community Resource ◽

Assembly Quality ◽

Sequencing Technologies ◽

Analysis Workflow ◽

Genome Assemblies

Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. In order to guide forthcoming genome generation efforts and promote efficient prioritisation of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data. Here we present an automated analysis workflow that surveys genome assemblies from the United States National Center for Biotechnology Information (NCBI), assesses their completeness using the relevant Benchmarking Universal Single-Copy Orthologue (BUSCO) datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, we examine how key assembly metrics relate to gene content completeness, and we compare results from using different BUSCO lineage datasets. These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritisations for ongoing and future sampling, sequencing, and genome generation initiatives.

Download Full-text