Fast text processing for information retrieval

Author(s):  
Tomek Strzalkowski ◽  
Barabara Vauthey
Author(s):  
Francisco M. Couto ◽  
Mário J. Silva ◽  
Vivian Lee ◽  
Emily Dimmer ◽  
Evelyn Camon ◽  
...  

Molecular Biology research projects produced vast amounts of data, part of which has been preserved in a variety of public databases. However, a large portion of the data contains a significant number of errors and therefore requires careful verification by curators, a painful and costly task, before being reliable enough to derive valid conclusions from it. On the other hand, research in biomedical information retrieval and information extraction are nowadays delivering Text Mining solutions that can support curators to improve the efficiency of their work to deliver better data resources. Over the past decades, automatic text processing systems have successfully exploited biomedical scientific literature to reduce the researchers’ efforts to keep up to date, but many of these systems still rely on domain knowledge that is integrated manually leading to unnecessary overheads and restrictions in its use. A more efficient approach would acquire the domain knowledge automatically from publicly available biological sources, such as BioOntologies, rather than using manually inserted domain knowledge. An example of this approach is GOAnnotator, a tool that assists the verification of uncurated protein annotations. It provided correct evidence text at 93% precision to the curators and thus achieved promising results. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/.


2020 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Stemming has long been used in data pre-processing in information retrieval, which aims to make affix words into root words. However, there are not many stemming methods for non-formal Indonesian text processing. The existing stemming method has high accuracy for formal Indonesian, but low for non-formal Indonesian. Thus, the stemming method which has high accuracy for non-formal Indonesian classifier model is still an open-ended challenge. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to provide comprehensive research on improving the accuracy of text classifier models by strengthening on stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. In the future, the proposed stemming method can be used to develop the Indonesian text classifier model which can be used for various purposes including text clustering, summarization, detecting hate speech, and other text processing applications.


2011 ◽  
pp. 1360-1373
Author(s):  
Francisco M. Couto ◽  
Mário J. Silva ◽  
Vivian Lee ◽  
Emily Dimmer ◽  
Evelyn Camon ◽  
...  

Molecular Biology research projects produced vast amounts of data, part of which has been preserved in a variety of public databases. However, a large portion of the data contains a significant number of errors and therefore requires careful verification by curators, a painful and costly task, before being reliable enough to derive valid conclusions from it. On the other hand, research in biomedical information retrieval and information extraction are nowadays delivering Text Mining solutions that can support curators to improve the efficiency of their work to deliver better data resources. Over the past decades, automatic text processing systems have successfully exploited biomedical scientific literature to reduce the researchers’ efforts to keep up to date, but many of these systems still rely on domain knowledge that is integrated manually leading to unnecessary overheads and restrictions in its use. A more efficient approach would acquire the domain knowledge automatically from publicly available biological sources, such as BioOntologies, rather than using manually inserted domain knowledge. An example of this approach is GOAnnotator, a tool that assists the verification of uncurated protein annotations. It provided correct evidence text at 93% precision to the curators and thus achieved promising results. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/.


2021 ◽  
Author(s):  
Fahed Mubarak Braik ◽  
Abdulla Sulaiman Al Shehhi ◽  
Luigi Saputelli ◽  
Carlos Mata ◽  
Dorzhi Badmaev ◽  
...  

Abstract The purpose of this paper is to communicate the experiences in the development of an innovative concept named "ASK Thamama" as an automated data and information retrieval engine driven by artificial intelligence techniques including text analytics and natural language processing. ASK is an AI enabled conversational search engine used to retrieve information from various internal data repositories using natural language queries. The text processing and conversational engine concept is built upon available open-source software requiring minimum coding of new libraries. A data set with 1000 documents was used to validate key functionalities with an accuracy of 90% of the search queries and able to provide specific answers for 80% of queries framed as questions. The results of this work show encouraging results and demonstrate value that AI-enabled methodologies can provide natural language search by enabling automated workflows for data information retrieval. The developed AI methodology has tremendous potential of integration in an end-to-end workflow of knowledge management by utilizing available document repositories to valuable insights, with little to no human intervention.


2018 ◽  
Vol Special Issue on... (Project presentations) ◽  
Author(s):  
Bastien Kindt

This paper presents some computer tools and linguistic resources of the GREgORI project. These developments allow automated processing of texts written in the main languages of the Christian Middel East, such as Greek, Arabic, Syriac, Armenian and Georgian. The main goal is to provide scholars with tools (lemmatized indexes and concordances) making corpus-based linguistic information available. It focuses on the questions of text processing, lemmatization, information retrieval, and bitext alignment.


Sign in / Sign up

Export Citation Format

Share Document