approximate search
Recently Published Documents


TOTAL DOCUMENTS

66
(FIVE YEARS 12)

H-INDEX

8
(FIVE YEARS 2)

2021 ◽  
Vol 16 (4) ◽  
pp. 30-35
Author(s):  
Prachi Gurav ◽  
Sanjeev Panandikar

As the world progresses towards automation, manual search for data from large databases also needs to keep pace. When the database includes health data, even minute aspects need careful scrutiny. Keyword search techniques are helpful in extracting data from large databases. There are two keyword search techniques: Exact and Approximate. When the user wants to search through EHR, a short search time is expected. To this end, this work investigates Metaphone (Exact search) and Similar_Text (approximate search) Techniques. We have applied keyword search to the data, which includes the symptoms and names of medicines. Our results indicate that the search time for Similar_text is better than for Metaphone.


2021 ◽  
Author(s):  
Ghadeer Mobasher ◽  
Lukrecia Mertova ◽  
Sucheta Ghosh ◽  
Olga Krebs ◽  
Bettina Heinlein ◽  
...  

Chemical named entity recognition (NER) is a significant step for many downstream applications like entity linking for the chemical text-mining pipeline. However, the identification of chemical entities in a biomedical text is a challenging task due to the diverse morphology of chemical entities and the different types of chemical nomenclature. In this work, we describe our approach that was submitted for BioCreative version 7 challenge Track 2, focusing on the "Chemical Identification" task for identifying chemical entities and entity linking, using MeSH. For this purpose, we have applied a two-stage approach as follows (a) usage of fine-tuned BioBERT for identification of chemical entities (b) semantic approximate search in MeSH and PubChem databases for entity linking. There was some friction between the two approaches, as our rule-based approach did not harmonise optimally with partially recognized words forwarded by the BERT component. For our future work, we aim to resolve the issue of the artefacts arising from BERT tokenizers and develop joint learning of chemical named entity recognition and entity linking using pretrained transformer-based models and compare their performance with our preliminary approach. Next, we will improve the efficiency of our approximate search in reference databases during entity linking. This task is non-trivial as it entails determining similarity scores of large sets of trees with respect to a query tree. Ideally, this will enable flexible parametrization and rule selection for the entity linking search.


Author(s):  
Larissa C. Shimomura ◽  
Daniel S. Kaster

Similarity searching is a widely used approach to retrieve complex data (images, videos, time series, etc.). Similarity searches aim at retrieving similar data according to the intrinsic characteristics of the data. Recently, graph-based methods have emerged as a very efficient alternative for similarity retrieval, with reports indicating they have outperformed methods of other categories in several situations. This work presents two main contributions to graph-based methods for similarity searches. The first contribution is a survey on the main graph types currently employed for similarity searches and an experimental evaluation of the most representative graphs in a common platform for exact and approximate search algorithms. The second contribution is a new graph-based method called HGraph, which is a connected-partition approach to build a proximity graph and answer similarity searches. Both of our contributions and results were published and received awards in international conferences.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Galia R. Zimerman ◽  
Dina Svetlitsky ◽  
Meirav Zehavi ◽  
Michal Ziv-Ukelson

AbstractGene clusters are groups of genes that are co-locally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQ-Tree Search that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer arguments $$d_T$$ d T and $$d_S$$ d S , and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by $$d_T$$ d T and $$d_S$$ d S , respectively. We prove that PQ-Tree Search is -hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in $$O^*(2^{\gamma })$$ O ∗ ( 2 γ ) time, where $$\gamma$$ γ is the maximum degree of a node in T and $$O^*$$ O ∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-trees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.


2021 ◽  
Author(s):  
Silvana Ilie

Background The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. Findings SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. Conclusion Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.


2021 ◽  
Author(s):  
Silvana Ilie

Background The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. Findings SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. Conclusion Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.


Author(s):  
Andrés Cano ◽  
Manuel Gómez-Olmedo ◽  
Serafín Moral ◽  
Serafín Moral-García

Given a set of uncertain discrete variables with a joint probability distribution and a set of observations for some of them, the most probable explanation is a set or configuration of values for non-observed variables maximizing the conditional probability of these variables given the observations. This is a hard problem which can be solved by a deletion algorithm with max marginalization, having a complexity similar to the one of computing conditional probabilities. When this approach is unfeasible, an alternative is to carry out an approximate deletion algorithm, which can be used to guide the search of the most probable explanation, by using A* or branch and bound (the approximate+search approach). The most common approximation procedure has been the mini-bucket approach. In this paper it is shown that the use of probability trees as representation of potentials with a pruning of branches with similar values can improve the performance of this procedure. This is corroborated with an experimental study in which computation times are compared using randomly generated and benchmark Bayesian networks from UAI competitions.


2020 ◽  
Vol 67 ◽  
pp. 581-606 ◽  
Author(s):  
Lemao Liu ◽  
Andrew Finch ◽  
Masao Utiyama ◽  
Eiichiro Sumita

Recurrent neural networks are extremely appealing for sequence-to-sequence learning tasks. Despite their great success, they typically suffer from a shortcoming: they are prone to generate unbalanced targets with good prefixes but bad suffixes, and thus performance suffers when dealing with long sequences. We propose a simple yet effective approach to overcome this shortcoming. Our approach relies on the agreement between a pair of target-directional RNNs, which generates more balanced targets. In addition, we develop two efficient approximate search methods for agreement that are empirically shown to be almost optimal in terms of either sequence level or non-sequence level metrics. Extensive experiments were performed on three standard sequence-to-sequence transduction tasks: machine transliteration, grapheme-to-phoneme transformation and machine translation. The results show that the proposed approach achieves consistent and substantial improvements, compared to many state-of-the-art systems.


2020 ◽  
pp. 235-271
Author(s):  
Thomas Mailund
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document