scholarly journals sOCRates - a post-OCR text correction method

2021 ◽  
Author(s):  
Danny Suarez Vargas ◽  
Lucas Lima de Oliveira ◽  
Viviane P. Moreira ◽  
Guilherme Torresan Bazzo ◽  
Gustavo Acauan Lorentz

A significant portion of the textual information of interest to an organization is stored in PDF files that should be converted into plain text before their contents can be processed by an information retrieval or text mining system. When the PDF documents consist of scanned documents, optical character recognition (OCR) is typically used to extract the textual contents. OCR errors can have a negative impact on the quality of information retrieval systems since the terms in the query will not match incorrectly extracted terms in the documents. This work introduces sOCRates, a post-OCR text correction method that relies on contextual word embeddings and on a classifier that uses format, semantic, and syntactic features. Our experimental evaluation on a test collection in Portuguese showed that sOCRates can accurately correct errors and improve retrieval results.

The vehicles playing the vital role in our day to day life for transport, and some of the vehicles violates the traffic rules are also increasing, vehicle theft, unnecessary entering into highly restricted areas, increased number of accidents lead to increase in the rate of crime slowly. The vehicle had its own identity it should be recognized which plays the major role in the world. For recognition of the vehicles which are used commonly in the field of safety and security system, LPDR plays a major role and the vehicle registration number is recognized at some certain distance accurately. License Plate recognition is the most efficient and cost effective technique used for detection and recognition purposes. Automatic license plate recognition (ALPR) is used for finding the location of the license plate in the vehicle. These methods and techniques vary based on the conditions like, quality of the image, vehicle on a fine-tuned position, effects of lighting, type of image, etc. The objective is to design an efficient automatic conveyance identification system of sanctioned or unauthorized in the residential societies by utilizing the conveyance number plate. By getting the car image from the surveillance camera in the entrance, we recognizing the number plate and the characters are extracted using OCR (optical character recognition). It converts the character in the image to plain text. Then the plain text of the license plate is cross-verified with the database to check whether the vehicle belongs to residents or visitor. It sends the alert message to the security official when a new visitor request method in a residential area. The log details are stored separately for the resident and visitor in the database. It also provides the details about the parking area availability in the residential area. By calculating the number of vehicles in and out of the area, the detail or availability parking slot is displayed and it sis robust to the size, lighting effects with high rate of detection.


2015 ◽  
Vol 67 (4) ◽  
pp. 408-421
Author(s):  
Sri Devi Ravana ◽  
MASUMEH SADAT TAHERI ◽  
Prabha Rajagopal

Purpose – The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach – Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document’s weight, which play the role of the mean average precision (MAP) score of the systems as a significance test’s statics. The experiments were conducted using the TREC 9 Web track collection. Findings – The p-values generated through the two types of significance tests, namely the Student’s t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value – Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.


2014 ◽  
Vol 2014 ◽  
pp. 1-13 ◽  
Author(s):  
Parnia Samimi ◽  
Sri Devi Ravana

Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment.


2011 ◽  
Vol 17 (2) ◽  
pp. 265-282 ◽  
Author(s):  
ULRICH REFFLE

AbstractText correction systems rely on a core mechanism where suitable correction suggestions for garbled input tokens are generated. Current systems, which are designed for documents including modern language, use some form of approximate search in a given background lexicon. Due to the large amount of spelling variation found in historical documents, special lexica for historical language can only offer restricted coverage. Hence historical language is often described in terms of a matching procedure to be applied to modern words. Given such a procedure and a base lexicon of modern words, the question arises of how to generate correction suggestions for garbled historical variants. In this paper we suggest an efficient algorithm that solves this problem. The algorithm is used for postcorrection of optical character recognition results on historical document collections.


2020 ◽  
Vol 8 (4) ◽  
pp. 263-269
Author(s):  
Ahmad Syarif Rosidy ◽  
Tubagus Mohammad Akhriza ◽  
Mochammad Husni

Event organizers in Indonesia often use websites to disseminate information about these events through digital posters. However, manually processing for transferring information from posters to websites is constrained by time efficiency, given the increasing number of posters uploaded. Also, information retrieval methods, such as Named Entity Recognition (NER) for Indonesian posters, are still rarely discussed in the literature. In contrast, the NER method application to Indonesian corpus is challenged by accuracy improvement because Indonesian is a low-resource language that causes a lack of corpus availability as a reference. This study proposes a solution to improve the efficiency of information extraction time from digital posters. The proposed solution is a combination of the NER method with the Optical Character Recognition (OCR) method to recognize text on posters developed with the support of relevant training data corpus to improve accuracy. The experimental results show that the system can increase time efficiency by 94 % with 82-92 % accuracy for several extracted information entities from 50 testing digital posters.


1981 ◽  
Vol 3 (4) ◽  
pp. 177-183 ◽  
Author(s):  
Martin Lennon ◽  
David S. Peirce ◽  
Brian D. Tarry ◽  
Peter Willett

The characteristics of conflation algorithms are discussed and examples given of some algorithms which have been used for information retrieval systems. Comparative experiments with a range of keyword dictionaries and with the Cranfield document test collection suggest that there is relatively little difference in the performance of the algorithms despite the widely disparate means by which they have been developed and by which they operate.


10.29007/jx6c ◽  
2018 ◽  
Author(s):  
Deepak Vala ◽  
Umeshkumar Baria ◽  
Urvi Bhagat ◽  
Mohan Khambalkar

In this paper presents optical character recognition robot (OCR) which is capable of converting image into the computer process able format, in the form of plain text using Raspberry pi and a webcam server where we can live stream video over a local network. Our ultimate goal is to find and solve the different requirements in making a web controlled robot that recognizes and converts textual messages placed in real world to the computer readable text files. Our objective is to integrate the appropriate techniques to explain and prove that such capability, using limited hardware and software capabilities. The objective of our work is to provide an internet controlled mobile robot with the capability of reading characters in the image and gives out strings of characters. In the project we will use MOTION software, which is open source software with a number of configuration options which can be changed according to our needs. Here configurations are to be made so that it allows you to view from any computer on the local network for the control of robot in non-line of sight areas.


2019 ◽  
Vol 4 (2) ◽  
pp. 335
Author(s):  
Anis Nadiah Che Abdul Rahman ◽  
Imran Ho Abdullah ◽  
Intan Safinaz Zainuddin ◽  
Azhar Jaludin

Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.


Sign in / Sign up

Export Citation Format

Share Document