scholarly journals Corpus-based technique for improving Arabic OCR system

Author(s):  
Ahmed Hussain Aliwy ◽  
Basheer Al-Sadawi

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Author(s):  
Husni Al-Muhtaseb ◽  
Rami Qahwaji

Arabic text recognition is receiving more attentions from both Arabic and non-Arabic-speaking researchers. This chapter provides a general overview of the state-of-the-art in Arabic Optical Character Recognition (OCR) and the associated text recognition technology. It also investigates the characteristics of the Arabic language with respect to OCR and discusses related research on the different phases of text recognition including: pre-processing and text segmentation, common feature extraction techniques, classification methods and post-processing techniques. Moreover, the chapter discusses the available databases for Arabic OCR research and lists the available commercial Software. Finally, it explores the challenges related to Arabic OCR and discusses possible future trends.


2015 ◽  
Vol 15 (01) ◽  
pp. 1550002
Author(s):  
Brij Mohan Singh ◽  
Rahul Sharma ◽  
Debashis Ghosh ◽  
Ankush Mittal

In many documents such as maps, engineering drawings and artistic documents, etc. there exist many printed as well as handwritten materials where text regions and text-lines are not parallel to each other, curved in nature, and having various types of text such as different font size, text and non-text areas lying close to each other and non-straight, skewed and warped text-lines. Optical character recognition (OCR) systems available commercially such as ABYY fine reader and Free OCR, are not capable of handling different ranges of stylistic document images containing curved, multi-oriented, and stylish font text-lines. Extraction of individual text-lines and words from these documents is generally not straight forward. Most of the segmentation works reported is on simple documents but still it remains a highly challenging task to implement an OCR that works under all possible conditions and gives highly accurate results, especially in the case of stylistic documents. This paper presents dilation and flood fill morphological operations based approach that extracts multi-oriented text-lines and words from the complex layout or stylistic document images in the subsequent stages. The segmentation results obtained from our method proves to be superior over the standard profiling-based method.


Author(s):  
M A Mikheev ◽  
P Y Yakimov

The article is devoted to solving the problem of document versions comparison in electronic document management systems. Systems-analogues were considered, the process of comparing text documents was studied. In order to recognize the text on the scanned image, the technology of optical character recognition and its implementation — Tesseract library were chosen. The Myers algorithm is applied to compare received texts. The software implementation of the text document comparison module was implemented using the solutions described above.


2021 ◽  
pp. 894-911
Author(s):  
Bhavesh Kataria, Dr. Harikrishna B. Jethva

India's constitution has 22 languages written in 17 different scripts. These materials have a limited lifespan, and as generations pass, these materials deteriorate, and the vital knowledge is lost. This work uses digital texts to convey information to future generations. Optical Character Recognition (OCR) helps extract information from scanned manuscripts (printed text). This paper proposes a simple and effective solution of optical character recognition (OCR) Sanskrit Character from text document images using long short-term memory (LSTM) and neural networks of Sanskrit Characters. Existing methods focuses only upon the single touching characters. But our main focus is to design a robust method using Bidirectional Long Short-Term Memory (BLSTM) architecture for overlapping lines, touching characters in middle and upper zone and half character which would increase the accuracy of the present OCR system for recognition of poorly maintained Sanskrit literature.


Author(s):  
Dr. T. Kameswara Rao ◽  
K. Yashwanth Chowdary ◽  
I. Koushik Chowdary ◽  
K. Prasanna Kumar ◽  
Ch. Ramesh

In recent years, text extraction from document images is one of the most widely studied topics in Image Analysis and Optical Character Recognition. These extractions of document images can be used for document analysis, content analysis, document retrieval and many more. Many complex text extracting processes Maximization Likelihood (ML), Edge point detection, Corner point detection etc. are used to extract text documents from images. In this article, the corner point approach was used. To extract document from images we used a very simple approach based on FAST algorithm. Firstly, we divided the image into blocks and their density in each block was checked. The denser blocks were labeled as text blocks and the less dense were the image region or noise. Then we check the connectivity of the blocks to group the blocks so that the text part can be isolated from the image. This method is very fast and versatile, it can be used to detect various languages, handwriting and even images with a lot of noise and blur. Even though it is a very simple program the precision of this method is closer or higher than 90%. In conclusion, this method helps in more accurate and less complex detection of text from document images.


Sensors ◽  
2020 ◽  
Vol 20 (10) ◽  
pp. 2914
Author(s):  
Hubert Michalak ◽  
Krzysztof Okarma

Image binarization is one of the key operations decreasing the amount of information used in further analysis of image data, significantly influencing the final results. Although in some applications, where well illuminated images may be easily captured, ensuring a high contrast, even a simple global thresholding may be sufficient, there are some more challenging solutions, e.g., based on the analysis of natural images or assuming the presence of some quality degradations, such as in historical document images. Considering the variety of image binarization methods, as well as their different applications and types of images, one cannot expect a single universal thresholding method that would be the best solution for all images. Nevertheless, since one of the most common operations preceded by the binarization is the Optical Character Recognition (OCR), which may also be applied for non-uniformly illuminated images captured by camera sensors mounted in mobile phones, the development of even better binarization methods in view of the maximization of the OCR accuracy is still expected. Therefore, in this paper, the idea of the use of robust combined measures is presented, making it possible to bring together the advantages of various methods, including some recently proposed approaches based on entropy filtering and a multi-layered stack of regions. The experimental results, obtained for a dataset of 176 non-uniformly illuminated document images, referred to as the WEZUT OCR Dataset, confirm the validity and usefulness of the proposed approach, leading to a significant increase of the recognition accuracy.


Author(s):  
Shourya Roy ◽  
L. Venkata Subramaniam

Accdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer be at the rghit pclae. Tihs is bcuseae the human mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.1 Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter. We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section. Online Noisy Documents: E-mails, chat logs, scrapbook entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers. SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007). Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text. Text Generated by OCR Devices: Optical character recognition, or ‘OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting. Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chattranscripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc. In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.


Author(s):  
Neha. N

Document image processing is an increasingly important technology essential in all optical character recognition (OCR) systems and for automation of various office documents. A document originally has zero-skew (tilt), but when a page is scanned or photo copied, skew may be introduced due to various factors and is practically unavoidable. Presence even a small amount of skew (0.50) will have detrimental effects on document analysis as it has a direct effect on the reliability and efficiency of segmentation, recognition and feature extraction stages. Therefore removal of skew is of paramount importance in the field of document analysis and OCR and is the first step to be accomplished. This paper presents a novel technique for skew detection and correction which is both language and content independent. The proposed technique is based on the maximum density of the foreground pixels and their orientation in the document image. Unlike other conventional algorithms which work only for machine printed textual documents scripted in English, this technique works well for all kinds of document images (machine printed, hand written, complex, noisy and simple). The technique presented here is tested with 150 different document image samples and is found to provide results with an accuracy of 0.10


Author(s):  
Md. Anwar Hossain ◽  
Sadia Afrin

This paper presents an innovative design for Optical Character Recognition (OCR) from text images by using the Template Matching method.OCR is an important research area and one of the most successful applications of technology in the field of pattern recognition and artificial intelligence.OCR provides full alphanumeric visualization of printed and handwritten characters by scanning text images and converts it into a corresponding editable text document. The main objective of this system prototype is to develop a prototype for the OCR system and to implement The Template Matching algorithm for provoking the system prototype. In this paper, we took alphabet (A-Z and a-z), and numbers (0-1), grayscale images, bitmap image format were used and recognized the alphabet and numbers by comparing between two images. Besides, we checked accuracy for different fonts of alphabet and numbers. Here we used Matlab R 2018 a software for the proper implementation of the system.


Sign in / Sign up

Export Citation Format

Share Document