A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

ABSTRACTDrawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of k-mers is many times of the number of words in NLP, making the model too big to be useful. Second, sequencing errors create lots of rare k-mers (noise), making the model hard to be trained. In this work, we leverage Locality Sensitive Hashing (LSH) to overcoming these challenges. We then adopted the skip-gram with negative sampling model to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrated that LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods. Finally, we demonstrate the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and predict their taxonomy, and this method is robust on reads with high sequencing error rates (12-22%).

Download Full-text

Protecting Genomic Sequence Anonymity with Generalization Lattices

Methods of Information in Medicine ◽

10.1055/s-0038-1634025 ◽

2005 ◽

Vol 44 (05) ◽

pp. 687-692 ◽

Cited By ~ 18

Author(s):

B. A. Malin

Keyword(s):

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Personal Information ◽

Control Technique ◽

Single Nucleotide ◽

Specific Data ◽

Disclosure Control ◽

Genomic Privacy ◽

Nucleotide Region

Summary Objectives: Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual’s identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. Methods: The technique is termed DNA lattice an-onymization (DNALA), and is based upon the formal privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). Results: The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. Conclusions: The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific data-sharing scenarios.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

Pembentukan Pustaka Genom, Resekuensing, dan Identifikasi SNP Berdasarkan Sekuen Genom Total Genotipe Kedelai Indonesia

Jurnal AgroBiogen ◽

10.21082/jbio.v11n1.2015.p7-16 ◽

2016 ◽

Vol 11 (1) ◽

pp. 7 ◽

Cited By ~ 2

Author(s):

I Made Tasma ◽

Dani Satyawan ◽

Habib Rijzaani

Keyword(s):

Genomic Library ◽

Genomic Sequence ◽

Sequence Data ◽

Soybean Genome ◽

Quality Analysis ◽

Reference Sequence ◽

Sequencing Error ◽

Whole Genome ◽

Genome Wide ◽

Soybean Genotypes

Resequencing of the soybean genome facilitates SNP marker discoveries useful for supporting the national soybean breeding programs. The objectives of the present study were to construct soybean genomic libraries, to resequence the whole genome of five Indonesian soybean genotypes, and to identify SNPs based on the resequence data. The studies consisted of genomic library construction and quality analysis, resequencing the whole-genome of five soybean genotypes, and genome-wide SNP identification based on alignment of the resequence data with reference sequence, Williams 82. The five Indonesian soybean genotypes were Tambora, Grobogan, B3293, Malabar, and Davros. The results showed that soybean genomic library was successfully constructed having the size of 400 bp with library concentrations range from 21.2–64.5 ng/μl. Resequencing of the libraries resulted in 50.1 x 109 bp total genomic sequence. The quality of genomic library and sequence data resulted from this study was high as indicated by Q score of 88.6% with low sequencing error of only 0.97%. Bioinformatic analysis resulted in a total of 2,597,286 SNPs, 257,598 insertions, and 202,157 deletions. Of the total SNPs identified, only 95,207 SNPs (2.15%) were located within exons. Among those, 49,926 SNPs caused missense mutation and 1,535 SNPs caused nonsense mutation. SNPs resulted from this study upon verification will be very useful for genome-wide SNP chip development of the soybean genome to accelerate breeding program of the soybean.

Download Full-text

Bovine Genome Analysis to Unravel the Location and Feature of Target Sites of RNA-Guided Hyperactivated Recombinase Gin with Spacer Length Six

Indian Journal of Animal Research ◽

10.18805/ijar.b-4693 ◽

2022 ◽

Author(s):

Shalu Kumari Pathak ◽

Arvind Sonwane ◽

Subodh Kumar

Keyword(s):

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Bovine Genome ◽

Search Pattern ◽

Spacer Length ◽

Guide Rna ◽

Target Sites ◽

Emboss Package ◽

Programmable Nucleases

Background: Programmable nucleases are very promising tools of genome editing (GE), but they suffer from limitations including potential risk of genotoxicity which led to the exploration of safer approach of GE based on RNA-guided recombinase (RGR) platform. RNA-guided recombinase (RGR) platform operates on a typical recognition or target site comprised of the minimal pseudo-core recombinase site, a 5 to 6-base pair spacer flanking it and whole this central region is flanked by two guide RNA-specified DNA sequences or Cas9 binding sites followed by protospacer adjacent motifs (PAMs). Methods: The current study focuses on analysis of entire cattle genome to prepare a detailed map of target sites for RNA-guided hyperactivated recombinase Gin with spacer length six. For this, chromosome wise whole genomic sequence data was retrieved from Ensembl. After that search pattern for recombinase Gin with spacer length six was designed. By using this search pattern, RGR target sites were located by using dreg program of Emboss package. Result: Total number of RGR target sites identified in bovine genome for recombinase Gin was 677 with spacer length six. It was also investigated that whether these RGR target sites are present with in any gene or not and it was found that RGR target sites lies in both genic and intergenic region. Besides this, description of genes in context with these target sites was identified.

Download Full-text

Natural Language Processing and Biological Methods

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch171 ◽

2011 ◽

pp. 1173-1178

Author(s):

Gemma Bel Enguix ◽

M. Dolores Jiménez López

Keyword(s):

Natural Language Processing ◽

Molecular Biology ◽

Natural Language ◽

Computer Science ◽

Genetic Code ◽

20Th Century ◽

Language Processing ◽

Dna Sequences ◽

Theoretical Computer Science ◽

Automated Generation

During the 20th century, biology—especially molecular biology—has become a pilot science, so that many disciplines have formulated their theories under models taken from biology. Computer science has become almost a bio-inspired field thanks to the great development of natural computing and DNA computing. From linguistics, interactions with biology have not been frequent during the 20th century. Nevertheless, because of the “linguistic” consideration of the genetic code, molecular biology has taken several models from formal language theory in order to explain the structure and working of DNA. Such attempts have been focused in the design of grammar-based approaches to define a combinatorics in protein and DNA sequences (Searls, 1993). Also linguistics of natural language has made some contributions in this field by means of Collado (1989), who applied generativist approaches to the analysis of the genetic code. On the other hand, and only from theoretical interest a strictly, several attempts of establishing structural parallelisms between DNA sequences and verbal language have been performed (Jakobson, 1973, Marcus, 1998, Ji, 2002). However, there is a lack of theory on the attempt of explaining the structure of human language from the results of the semiosis of the genetic code. And this is probably the only arrow that remains incomplete in order to close the path between computer science, molecular biology, biosemiotics and linguistics. Natural Language Processing (NLP) –a subfield of Artificial Intelligence that concerns the automated generation and understanding of natural languages— can take great advantage of the structural and “semantic” similarities between those codes. Specifically, taking the systemic code units and methods of combination of the genetic code, the methods of such entity can be translated to the study of natural language. Therefore, NLP could become another “bio-inspired” science, by means of theoretical computer science, that provides the theoretical tools and formalizations which are necessary for approaching such exchange of methodology. In this way, we obtain a theoretical framework where biology, NLP and computer science exchange methods and interact, thanks to the semiotic parallelism between the genetic code and natural language.

Download Full-text

An Approach Based Natural Language Processing for DNA Sequences Encoding Using the Global Vectors for Word Representation

Lecture Notes on Data Engineering and Communications Technologies - Innovative Systems for Intelligent Health Informatics ◽

10.1007/978-3-030-70713-2_53 ◽

2021 ◽

pp. 577-585

Author(s):

Brahim Matougui ◽

Hacene Belhadef ◽

Ilham Kitouni

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Dna Sequences ◽

Word Representation

Download Full-text

A text-based multi-span network for reading comprehension

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200581 ◽

2021 ◽

pp. 1-13

Author(s):

Deguang Chen ◽

Ziping Ma ◽

Lin Wei ◽

Yanbin Zhu ◽

Jinlin Ma ◽

...

Keyword(s):

Reading Comprehension ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Market Value ◽

The State ◽

Experimental Results ◽

Training Time ◽

The Public

Text-based reading comprehension models have great research significance and market value and are one of the main directions of natural language processing. Reading comprehension models of single-span answers have recently attracted more attention and achieved significant results. In contrast, multi-span answer models for reading comprehension have been less investigated and their performances need improvement. To address this issue, in this paper, we propose a text-based multi-span network for reading comprehension, ALBERT_SBoundary, and build a multi-span answer corpus, MultiSpan_NMU. We also conduct extensive experiments on the public multi-span corpus, MultiSpan_DROP, and our multi-span answer corpus, MultiSpan_NMU, and compare the proposed method with the state-of-the-art. The experimental results show that our proposed method achieves F1 scores of 84.10 and 92.88 on MultiSpan_DROP and MultiSpan_NMU datasets, respectively, while it also has fewer parameters and a shorter training time.

Download Full-text

Word Embedding for the French Natural Language in Health Care: Comparative Study

JMIR Medical Informatics ◽

10.2196/12310 ◽

2019 ◽

Vol 7 (3) ◽

pp. e12310 ◽

Cited By ~ 5

Author(s):

Emeric Dynomant ◽

Romain Lelong ◽

Badisse Dahamna ◽

Clément Massonnaud ◽

Gaétan Kerdelhué ◽

...

Keyword(s):

Natural Language ◽

Language Processing ◽

Feature Learning ◽

Discharge Summary ◽

Word Embedding ◽

University Hospital ◽

Training Time ◽

Formal Evaluation ◽

Wide Range ◽

Human Validation

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

Download Full-text

Textual Profiles of Genes

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0010 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Scientific Literature ◽

Sequence Data ◽

Expression Profiles ◽

Data Sets ◽

Expression Array ◽

Document Similarity

Using algorithms to analyze natural language text is a challenging task. Recent advances in algorithms, and increased availability of computational power and online text has resulted in incremental progress in text analysis (Rosenfeld 2000). For certain specific applications natural language processing algorithms can rival human performance. Even the simplest algorithms and approaches can glean information from the text and do it at a rate much faster than humans. In the case of functional genomics, where an individual assay might include thousands of genes, and tens of thousands of documents pertinent to those genes, the speed of text mining approaches offers a great advantage to investigators trying to understand the data. In this chapter, we will focus on techniques to convert text into simple numerical vectors to facilitate computation. Then we will go on to discuss how these vectors can be combined into textual profiles for genes; these profiles offer additional biologically meaningful information that can complement available genomics data sets. The previous chapter introduced methods to analyze gene expression data and sequence data. The focus of many analytical methods was comparing and grouping genes by similarity. Some sequence analysis methods like dynamic programming and BLAST offer opportunities to compare two sequences, while multiple sequence alignment and weight matrices provide a means to compare families of sequences. Similarly, gene expression array analysis approaches are mostly contingent on distance metrics that compare gene expression profiles to each other; clustering and classification algorithms provide a means to group similar genes. The primary goal of applying these methods was to transfer knowledge between similar genes. We can think of the scientific literature as yet another data type and define document similarity metrics. Algorithms that tap the knowledge locked in the scientific literature require sophisticated natural language processing approaches. On the other hand, assessing document similarity is a comparatively easier task. A measure of document similarity that corresponds to semantic similarity between documents can also be powerful. For example, we might conclude that two genes are related if documents that refer to them are semantically similar.

Download Full-text

A Preliminary Metagenome Analysis Based on a Combination of Protein Domains

Proteomes ◽

10.3390/proteomes7020019 ◽

2019 ◽

Vol 7 (2) ◽

pp. 19

Author(s):

Yoji Igarashi ◽

Daisuke Mori ◽

Susumu Mitsuyama ◽

Kazutoshi Yoshitake ◽

Hiroaki Ono ◽

...

Keyword(s):

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Bacterial Species ◽

Protein Domains ◽

Correlation Coefficients ◽

Environmental Data ◽

Protein Domain ◽

Homology Search ◽

Metagenomic Data

Metagenomic data have mainly been addressed by showing the composition of organisms based on a small part of a well-examined genomic sequence, such as ribosomal RNA genes and mitochondrial DNAs. On the contrary, whole metagenomic data obtained by the shotgun sequence method have not often been fully analyzed through a homology search because the genomic data in databases for living organisms on earth are insufficient. In order to complement the results obtained through homology-search-based methods with shotgun metagenomes data, we focused on the composition of protein domains deduced from the sequences of genomes and metagenomes, and we utilized them in characterizing genomes and metagenomes, respectively. First, we compared the relationships based on similarities in the protein domain composition with the relationships based on sequence similarities. We searched for protein domains of 325 bacterial species produced using the Pfam database. Next, the correlation coefficients of protein domain compositions between every pair of bacteria were examined. Every pairwise genetic distance was also calculated from 16S rRNA or DNA gyrase subunit B. We compared the results of these methods and found a moderate correlation between them. Essentially, the same results were obtained when we used partial random 100 bp DNA sequences of the bacterial genomes, which simulated raw sequence data obtained from short-read next-generation sequences. Then, we applied the method for analyzing the actual environmental data obtained by shotgun sequencing. We found that the transition of the microbial phase occurred because the seasonal change in water temperature was shown by the method. These results showed the usability of the method in characterizing metagenomic data based on protein domain compositions.

Download Full-text