Quick and efficient approach to develop genomic resources in orphan species: Application in Lavandula angustifolia

Next-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors to generate genomic data in a range of previously poorly studied species. In this study, we propose a method for the rapid development of a large-scale molecular resources for orphan species. We studied as an example the true lavender (Lavandula angustifolia Mill.), a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines. The heterozygous clone “Maillette” was used as a reference for DNA and RNA sequencing. We first built a reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with introns and exons) using an Unigene-guided DNA-seq assembly approach. This aimed to maximize the possibilities of finding polymorphism between genetically close individuals despite the lack of a reference genome. Finally, we used these resources for SNP mining within a collection of 16 commercial lavender clones and tested the SNP within the scope of a genetic distance analysis. We obtained a cleaned reference of 8, 030 functionally in silico annotated genes. We found 359K polymorphic sites and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). On overall, we found similar genetic distances between pairs of clones, which is probably related to the out-crossing nature of the species and the restricted area of cultivation. The proposed method is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is also the first reported large-scale SNP development on Lavandula angustifolia. All the genomics resources developed herein are publicly available and provide a rich pool of molecular resources to explore and exploit lavender genetic diversity in breeding programs.

Download Full-text

Quick and efficient approach to develop genomic resources in orphan species: application in Lavandula angustifolia

10.1101/381400 ◽

2018 ◽

Author(s):

Berline Fopa Fomeju ◽

Dominique Brunel ◽

Aurélie Bérard ◽

Jean-Baptiste Rivoal ◽

Philippe Gallois ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Genetic Distances ◽

Lavandula Angustifolia ◽

Alternative Medicines ◽

Snp Development ◽

High Level ◽

The Cost ◽

Next Generation Sequencing Ngs

AbstractNext-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors of research efforts to generate genomic data to a range of previously poorly studied species. In this study, we proposed a method for the rapid development of a large scale molecular resources for orphan species. We studied as an example Lavandula angustifolia, a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines.We first built a ‘Maillette’ reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with exons and introns) using a transcriptome-guided DNA-seq assembly approach in order to maximize the possibilities of finding polymorphism between genetically close individuals. Finally, we used these resources for SNP mining within a collection of 16 lavender clones and tested the SNP within the scope of a phylogeny analysis. We obtained a cleaned reference of 8, 030 functionally annotated ‘genes’ (in silico annotation). We found up to 400K polymorphic sites, depending on the genotype analyzed, and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). We found similar genetic distances between pairs of clones, related to the out-crossing nature of the species, the restricted area of cultivation and the clonal propagation of the varieties.The method propose is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is the first reported large-scale SNP development on Lavandula angustifolia. All this data provides a rich pool of molecular resource to explore and exploit biodiversity in breeding programs.

Download Full-text

Comparative genomics and pangenome-oriented studies reveal high homogeneity of the agronomically relevant enterobacterial plant pathogen Dickeya solani

10.21203/rs.3.rs-20034/v3 ◽

2020 ◽

Author(s):

Agata Motyka-Pomagruk ◽

Sabina Zoledowska ◽

Agnieszka Emilia Misztak ◽

Wojciech Sledz ◽

Alessio Mengoni ◽

...

Keyword(s):

Comparative Genomics ◽

Large Scale ◽

De Novo ◽

Genetic Material ◽

Soft Rot ◽

Potato Production ◽

Core Gene ◽

Ecological Niches ◽

Dickeya Solani ◽

High Level

Abstract Background: Dickeya solani is an important plant pathogenic bacterium causing severe losses in European potato production. This species draws a lot of attention due to its remarkable virulence, great devastating potential and easier spread in contrast to other Dickeya spp. In view of a high need for extensive studies on economically important soft rot Pectobacteriaceae , we performed a comparative genomics analysis on D. solani strains to search for genetic foundations that would explain the differences in the observed virulence levels within the D. solani population. Results: High quality assemblies of 8 de novo sequenced D. solani genomes have been obtained. Whole-sequence comparison, ANIb, ANIm, Tetra and pangenome-oriented analyses performed on these genomes and the sequences of 14 additional strains revealed an exceptionally high level of homogeneity among the studied genetic material of D. solani strains. With the use of 22 genomes, the pangenome of D. solani , comprising 84.7% core, 7.2% accessory and 8.1% unique genes, has been almost completely determined, suggesting the presence of a nearly closed pangenome structure. Attribution of the genes included in the D. solani pangenome fractions to functional COG categories showed that higher percentages of accessory and unique pangenome parts in contrast to the core section are encountered in phage/mobile elements- and transcription- associated groups with the genome of RNS 05.1.2A strain having the most significant impact. Also, the first D. solani large-scale genome-wide phylogeny computed on concatenated core gene alignments is herein reported. Conclusions: The almost closed status of D. solani pangenome achieved in this work points to the fact that the unique gene pool of this species should no longer expand. Such a feature is characteristic of taxa whose representatives either occupy isolated ecological niches or lack efficient mechanisms for gene exchange and recombination, which seems rational concerning a strictly pathogenic species with clonal population structure. Finally, no obvious correlations between the geographical origin of D. solani strains and their phylogeny were found, which might reflect the specificity of the international seed potato market.

Download Full-text

Views on Differential Management in Safety Supervision of Construction Engineering

Journal of Architectural Research and Development ◽

10.26689/jard.v2i2.313 ◽

2018 ◽

Vol 2 (2) ◽

Author(s):

Song Yinghua

Keyword(s):

Large Scale ◽

Rapid Development ◽

Science And Technology ◽

Deep Foundation ◽

Safety Risk ◽

Construction Engineering ◽

Engineering Construction ◽

Engineering Structure ◽

High Level ◽

Socialist Market

Given the advances in science and technology, rapid development of socialist market economyÂ and continuous advance of urbanization, it is necessaryÂ to enlarge the scale of engineering construction. As theÂ form of engineering structure becomes more complex,Â large-scale and high-level projects with deep foundation have appeared in engineering construction. ForÂ construction engineering, one of its technologies includes solving the difficulties in construction. It is required to deal with the safety risk of construction in timeÂ to guarantee safety construction, timely solve theÂ management difficulties and contradictory problems ofÂ the project and ensure both the safety of engineeringÂ construction and the rationalization of the institutionÂ setting of the safety supervision on the project.Â

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Multi-scale structural analysis of proteins by deep semantic segmentation

Bioinformatics ◽

10.1093/bioinformatics/btz650 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1740-1749 ◽

Cited By ~ 2

Author(s):

Raphael R Eguchi ◽

Po-Ssu Huang

Keyword(s):

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Semantic Segmentation ◽

Structural Features ◽

Supplementary Information ◽

Structural Prediction ◽

Structure Accuracy ◽

High Level

Abstract Motivation Recent advances in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation—a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structure quality assessment. Results We train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model achieves a high per-residue accuracy of 90.8% on the test set (95.0% average per-class accuracy; 87.8% average per-structure accuracy). We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design. Availability and implementation The trained classifier network, parser network, and entropy calculation scripts are available for download at https://git.io/fp6bd, with detailed usage instructions provided at the download page. A step-by-step tutorial for setup is provided at https://goo.gl/e8GB2S. All Rosetta commands, RosettaRemodel blueprints, and predictions for all datasets used in the study are available in the Supplementary Information. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SpaRC: Scalable Sequence Clustering using Apache Spark

10.1101/246496 ◽

2018 ◽

Author(s):

Lizhen Shi ◽

Xiandong Meng ◽

Elizabeth Tseng ◽

Michael Mascagni ◽

Zhong Wang

Keyword(s):

Large Scale ◽

De Novo ◽

Sequence Data ◽

Rapid Development ◽

Cost Effective ◽

Apache Spark ◽

Next Generation ◽

Individual Gene ◽

Sequence Clustering ◽

Sequencing Technologies

AbstractWhole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.

Download Full-text

Knowledge-driven drug repurposing using a comprehensive drug knowledge graph

Health Informatics Journal ◽

10.1177/1460458220937101 ◽

2020 ◽

Vol 26 (4) ◽

pp. 2737-2750 ◽

Cited By ~ 2

Author(s):

Yongjun Zhu ◽

Chao Che ◽

Bo Jin ◽

Ningrui Zhang ◽

Chang Su ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Drug Repurposing ◽

Data Representation ◽

Knowledge Bases ◽

Knowledge Graph ◽

Multiple Drug ◽

Treatment Information ◽

Drug Knowledge

Due to the huge costs associated with new drug discovery and development, drug repurposing has become an important complement to the traditional de novo approach. With the increasing number of public databases and the rapid development of analytical methodologies, computational approaches have gained great momentum in the field of drug repurposing. In this study, we introduce an approach to knowledge-driven drug repurposing based on a comprehensive drug knowledge graph. We design and develop a drug knowledge graph by systematically integrating multiple drug knowledge bases. We describe path- and embedding-based data representation methods of transforming information in the drug knowledge graph into valuable inputs to allow machine learning models to predict drug repurposing candidates. The evaluation demonstrates that the knowledge-driven approach can produce high predictive results for known diabetes mellitus treatments by only using treatment information on other diseases. In addition, this approach supports exploratory investigation through the review of meta paths that connect drugs with diseases. This knowledge-driven approach is an effective drug repurposing strategy supporting large-scale prediction and the investigation of case studies.

Download Full-text

Comparative genomics and pangenome-oriented studies reveal high homogeneity of the agronomically relevant enterobacterial plant pathogen Dickeya solani

10.21203/rs.3.rs-20034/v2 ◽

2020 ◽

Author(s):

Agata Motyka-Pomagruk ◽

Sabina Zoledowska ◽

Agnieszka Emilia Misztak ◽

Wojciech Sledz ◽

Alessio Mengoni ◽

...

Keyword(s):

Comparative Genomics ◽

Large Scale ◽

De Novo ◽

Genetic Material ◽

Soft Rot ◽

Potato Production ◽

Core Gene ◽

Ecological Niches ◽

Dickeya Solani ◽

High Level

Download Full-text

deGSM: memory scalable construction of large scale de Bruijn Graph

10.1101/388454 ◽

2018 ◽

Cited By ~ 2

Author(s):

Hongzhe Guo ◽

Yilei Fu ◽

Yan Gao ◽

Junyi Li ◽

Yadong Wang ◽

...

Keyword(s):

Genome Sequence ◽

Large Scale ◽

High Throughput Sequencing ◽

De Novo ◽

Rapid Development ◽

Main Idea ◽

Supplementary Information ◽

De Bruijn Graph ◽

External Sorting ◽

De Bruijn

AbstractMotivationDe Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks such as de novo assembly, high-throughput sequencing (HTS) read alignment, pan-genome analysis, metagenomics analysis, HTS read correction, etc. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct de Bruijn graph for sequences up to Tera-base-pair level. It is non-trivial since the size of the graph to be constructed could be very large and each graph consists of hundreds of billions of vertices and edges. Current existing approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. Moreover, it also requires the construction approach to handle very large dataset efficiently, even if in a relatively small RAM space.ResultsWe propose a lightweight parallel de Bruijn graph construction approach, de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Bur-rows-Wheeler Transformation (BWT) of the unipaths of de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. It is mainly implemented by a fast parallel external sorting of k-mers, which allows only a part of k-mers kept in RAM by a novel organization of the k-mers. The experimental results demonstrate that, just with a commonly used machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in Gen-Bank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potentials in many large scale genomics studies.Availabilityhttps://github.com/hitbc/[email protected] (YW) and [email protected] (BL)Supplementary informationSupplementary data are available online.

Download Full-text

Comparative genomics and pangenome-oriented studies reveal high homogeneity of the agronomically relevant enterobacterial plant pathogen Dickeya solani

10.21203/rs.3.rs-20034/v1 ◽

2020 ◽

Author(s):

Agata Motyka-Pomagruk ◽

Sabina Zoledowska ◽

Agnieszka Emilia Misztak ◽

Wojciech Sledz ◽

Alessio Mengoni ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Genetic Material ◽

Soft Rot ◽

Potato Production ◽

Core Gene ◽

Ecological Niches ◽

Dickeya Solani ◽

Scientific Attention ◽

High Level

Abstract Background: Dickeya solani was pointed as a significant trait to potato production in Europe and drew much of scientific attention due to remarkable virulence, great devastating potential and easier spread in contrast to other Dickeya spp. In a view of a high need for extensive studies on economically important soft rot Pectobacteriaceae, we performed a nearly conclusive pangenome analysis on D. solani strains to search for genetic foundations that would explain the differences in the observed virulence levels within the D. solani population.Results: High quality assemblies of 8 de novo sequenced D. solani genomes have been obtained. Whole-sequence comparison, ANIb, ANIm, Tetra and pangenome-oriented analyses performed on these genomes sequences and the sequences of 14 additional strains revealed exceptionally high level of homogeneity among the studied genetic material of D. solani strains. With the use of 22 genomes, the pangenome of D. solani, comprising 84.7% core, 7.2% accessory and 8.1% unique genes, has been almost completely determined, suggesting the presence of a nearly closed pangenome structure. Attribution of the genes included in the D. solani pangenome fractions to functional COG categories revealed that higher percentages of accessory and unique pangenome parts in contrast to the core section are encountered in phage/mobile elements- and transcription- associated groups with the genome of RNS 05.1.2A strain having the most significant impact. Also, the first D. solani large-scale genome-wide phylogeny computed on concatenated core gene alignments is herein reported.Conclusions: The almost closed status of D. solani pangenome achieved in this work points to the fact that the unique gene pool of this species should no longer expand. Such a feature is characteristic for taxa, whose representatives either occupy isolated ecological niches or lack efficient mechanisms for gene exchange and recombination, which seems rational concerning a strictly pathogenic species with clonal spread and population structure. Finally, no obvious correlations between the geographical origin of D. solani strains and their phylogeny was found, which might reflect the specificity of the international seed potato market.

Download Full-text