The genomic data deficit: On the need to inform research subjects of the informational content of their genomic sequence data in consent for genomic research

The multispecies coalescent with introgression (MSci) model accommodates both the coalescent process and cross-species introgression/ hybridization events, two major processes that create genealogical fluctuations across the genome and gene-tree-species-tree discordance. Full likelihood implementations of the MSci model take such fluctuations as a major source of information about the history of species divergence and gene flow, and provide a powerful tool for estimating the direction, timing and strength of cross-species introgression using multilocus sequence data. However, introgression models, in particular those that accommodate bidirectional introgression (BDI), are known to cause unidentifiability issues of the label-switching type, whereby different models or parameters make the same predictions about the genomic data and thus cannot be distinguished by the data. Nevertheless, there has been no systematic study of unidentifiability when full likelihood methods are applied. Here we characterize the unidentifiability of arbitrary BDI models and derive simple rules for its identification. In general, an MSci model with k BDI events has 2^k unidentifiable towers in the posterior, with each BDI event between sister species creating within-model unidentifiability and each BDI between non-sister species creating cross-model unidentifiability. We develop novel algorithms for processing Markov chain Monte Carlo (MCMC) samples to remove label switching and implement them in the BPP program. We analyze genomic sequence data from Heliconius butterflies as well as synthetic data to illustrate the utility of the BDI models and the new algorithms.

Download Full-text

Factors influencing taxonomic unevenness in scientific research: a mixed-methods case study of non-human primate genomic sequence data generation

Royal Society Open Science ◽

10.1098/rsos.201206 ◽

2020 ◽

Vol 7 (9) ◽

pp. 201206

Author(s):

Margarita Hernandez ◽

Mary K. Shenk ◽

George H. Perry

Keyword(s):

Mixed Methods ◽

Genomic Sequence ◽

Sequence Data ◽

Scientific Research ◽

Genomic Data ◽

Primate Species ◽

Data Generation ◽

Human Primate ◽

Study Species

Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r 2 = 0.37; p = 6.15 × 10 −12 ) and number of medical publications ( r 2 = 0.27; p = 9.27 × 10 −9 ). In a generalized linear model, the number of non-medical publications ( p = 0.00064) and closer phylogenetic distance to humans ( p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.

Download Full-text

Genomic Sequence Data Compression using Lempel-Ziv-Welch Algorithm with Indexed Multiple Dictionary

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3278.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 541-547

Keyword(s):

Dna Sequence ◽

High Throughput ◽

Compression Ratio ◽

Genomic Sequence ◽

Sequence Data ◽

Genomic Data ◽

General Purpose ◽

Huge Amount ◽

Compression Time ◽

Average Size

With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.

Download Full-text

Faculty Opinions recommendation of A likelihood ratio test of speciation with gene flow using genomic sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.3540959.3240060 ◽

2010 ◽

Author(s):

Nicolas Galtier ◽

Julien Dutheil

Keyword(s):

Gene Flow ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Genomic Sequence ◽

Sequence Data ◽

Ratio Test

Download Full-text

PoGB-Pred: Prediction of Antifreeze Proteins Sequences using Amino Acid Composition with Feature Selection followed by a Sequential based Ensemble Approach

Current Bioinformatics ◽

10.2174/1574893615999200707141926 ◽

2020 ◽

Vol 15 ◽

Author(s):

Affan Alim ◽

Abdul Rafay ◽

Imran Naseem

Keyword(s):

Amino Acid ◽

Dimension Reduction ◽

Protein Identification ◽

Cold Water ◽

Genomic Sequence ◽

Sequence Data ◽

Antifreeze Proteins ◽

Building Blocks ◽

Gradient Boosting ◽

Proposed Model

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process which may cause the rupture in the internal cells and tissues. AFP’s have attracted attention and interest in food industries and cryopreservation. Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP structure. Methods: In this study, we propose to use machine learning-based algorithms Principal Component Analysis (PCA) followed by Gradient Boosting (GB) for antifreeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments composition of amino acid and dipeptide are used. PCA, in particular, is proposed to dimension reduction and high variance retaining of data which is followed by an ensemble method named gradient boosting for modelling and classification. Results: The proposed method obtained the superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method. Conclusion: AFPs have a common function with distinct structure. Therefore, the development of a single model for different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for analyzing the proteomic and genomic dataset.

Download Full-text

Special Issue: Genetic Basis of Phenotypic Variation in Drosophila and Other Insects

Genes ◽

10.3390/genes12081212 ◽

2021 ◽

Vol 12 (8) ◽

pp. 1212

Author(s):

J. Spencer Johnston ◽

Carl E. Hjelmen

Keyword(s):

Next Generation Sequencing ◽

Genetic Basis ◽

Genomic Sequence ◽

Sequence Data ◽

Complete Genomic Sequence ◽

Special Issue ◽

Model Species ◽

Road Map ◽

Generation Sequencing ◽

Complete Genomic

Next-generation sequencing provides a nearly complete genomic sequence for model and non-model species alike; however, this wealth of sequence data includes no road map [...]

Download Full-text

Discovery and assembly of repeat family pseudomolecules from sparse genomic sequence data using the Assisted Automated Assembler of Repeat Families (AAARF) algorithm

BMC Bioinformatics ◽

10.1186/1471-2105-9-235 ◽

2008 ◽

Vol 9 (1) ◽

pp. 235 ◽

Cited By ~ 22

Author(s):

Jeremy D DeBarry ◽

Renyi Liu ◽

Jeffrey L Bennetzen

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Repeat Family

Download Full-text

Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.054171-0 ◽

2014 ◽

Vol 64 (Pt_2) ◽

pp. 316-324 ◽

Cited By ~ 258

Author(s):

Jongsik Chun ◽

Fred A. Rainey

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Original Research ◽

Rrna Gene ◽

New Taxon ◽

Genome Sequences ◽

Microbial World ◽

Content Type ◽

Link Type ◽

Type Strains

The polyphasic approach used today in the taxonomy and systematics of the Bacteria and Archaea includes the use of phenotypic, chemotaxonomic and genotypic data. The use of 16S rRNA gene sequence data has revolutionized our understanding of the microbial world and led to a rapid increase in the number of descriptions of novel taxa, especially at the species level. It has allowed in many cases for the demarcation of taxa into distinct species, but its limitations in a number of groups have resulted in the continued use of DNA–DNA hybridization. As technology has improved, next-generation sequencing (NGS) has provided a rapid and cost-effective approach to obtaining whole-genome sequences of microbial strains. Although some 12 000 bacterial or archaeal genome sequences are available for comparison, only 1725 of these are of actual type strains, limiting the use of genomic data in comparative taxonomic studies when there are nearly 11 000 type strains. Efforts to obtain complete genome sequences of all type strains are critical to the future of microbial systematics. The incorporation of genomics into the taxonomy and systematics of the Bacteria and Archaea coupled with computational advances will boost the credibility of taxonomy in the genomic era. This special issue of International Journal of Systematic and Evolutionary Microbiology contains both original research and review articles covering the use of genomic sequence data in microbial taxonomy and systematics. It includes contributions on specific taxa as well as outlines of approaches for incorporating genomics into new strain isolation to new taxon description workflows.

Download Full-text

Comparative Species Divergence across Eight Triplets of Spiny Lizards (Sceloporus) Using Genomic Sequence Data

Genome Biology and Evolution ◽

10.1093/gbe/evt186 ◽

2013 ◽

Vol 5 (12) ◽

pp. 2410-2419 ◽

Cited By ~ 24

Author(s):

Adam D. Leaché ◽

Rebecca B. Harris ◽

Max E. Maliska ◽

Charles W. Linkem

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Species Divergence

Download Full-text

Predicting Chromosome Flexibility from the Genomic Sequence Based on Deep Learning Neural Networks

Current Bioinformatics ◽

10.2174/1574893616666210827095829 ◽

2021 ◽

Vol 16 ◽

Author(s):

Jinghao Peng ◽

Jiajie Peng ◽

Haiyin Piao ◽

Zhang Luo ◽

Kelin Xia ◽

...

Keyword(s):

Deep Learning ◽

High Performance ◽

Genomic Sequence ◽

Sequence Data ◽

Function Analysis ◽

Double Helix ◽

Gm12878 Cell ◽

Genomic Sequence Analysis ◽

And Function ◽

Nuclear Processes

Background: The open and accessible regions of the chromosome are more likely to be bound by transcription factors which are important for nuclear processes and biological functions. Studying the change of chromosome flexibility can help to discover and analyze disease markers and improve the efficiency of clinical diagnosis. Current methods for predicting chromosome flexibility based on Hi-C data include the flexibility-rigidity index (FRI) and the Gaussian network model (GNM), which have been proposed to characterize chromosome flexibility. However, these methods require the chromosome structure data based on 3D biological experiments, which is time-consuming and expensive. Objective: Generally, the folding and curling of the double helix sequence of DNA have a great impact on chromosome flexibility and function. Motivated by the success of genomic sequence analysis in biomolecular function analysis, we hope to propose a method to predict chromosome flexibility only based on genomic sequence data. Method: We propose a new method (named "DeepCFP") using deep learning models to predict chromosome flexibility based on only genomic sequence features. The model has been tested in the GM12878 cell line. Results: The maximum accuracy of our model has reached 91%. The performance of DeepCFP is close to FRI and GNM. Conclusion: The DeepCFP can achieve high performance only based on genomic sequence.

Download Full-text