Polarity Identification for Handwritten Text in Multilingual Documents Using Open Source Optical Character Recognition Tools

2020 ◽  
Vol 17 (9) ◽  
pp. 4045-4049
Author(s):  
C. P. Chandrika ◽  
Jagadish S. Kallimani

Sentimental analysis is a prerequisite for many applications. We propose a model which scans handwritten text in English and Kannada languages by a CamScanner and then translated into editable text by using various Open Source Optical Character Recognition tools. The performances of different OCRs are analyzed and tabulated. Sentimental analysis is performed on the statements written in both English and Kannada languages using Wordnet, Algorithmia Rest API and local dictionaries and we have obtained the satisfied results. The same sentimental analysis module is also applied on customer reviews for the mobile product and reviews are taken from Amazon Web Services. The opinion of the customer about the product can be identified correctly.

2020 ◽  
Vol 17 (9) ◽  
pp. 4267-4275
Author(s):  
Jagadish Kallimani ◽  
Chandrika Prasad ◽  
D. Keerthana ◽  
Manoj J. Shet ◽  
Prasada Hegde ◽  
...  

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.


2019 ◽  
Vol 8 (1) ◽  
pp. 50-54
Author(s):  
Ashok Kumar Bathla . ◽  
Sunil Kumar Gupta .

Optical Character Recognition (OCR) technology allows a computer to “read” text (both typed and handwritten) the way a human brain does.Significant research efforts have been put in the area of Optical Character Segmentation (OCR) of typewritten text in various languages, however very few efforts have been put on the segmentation and skew correction of handwritten text written in Devanagari which is a scripting language of Hindi. This paper aims a novel technique for segmentation and skew correction of hand written Devanagari text. It shows the accuracy of 91% and takes less than one second to segment a particular handwritten word.


2019 ◽  
Vol 8 (04) ◽  
pp. 24586-24602
Author(s):  
Manpreet Kaur ◽  
Balwinder Singh

Text classification is a crucial step for optical character recognition. The output of the scanner is non- editable. Though one cannot make any change in scanned text image, if required. Thus, this provides the feed for the theory of optical character recognition. Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text into a computer readable format. The process of OCR involves several steps including pre-processing after image acquisition, segmentation, feature extraction, and classification. The incorrect classification is like a garbage in and garbage out. Existing methods focuses only upon the classification of unmixed characters in Arab, English, Latin, Farsi, Bangla, and Devnagari script. The Hybrid Techniques is solving the mixed (Machine printed and handwritten) character classification problem. Classification is carried out on different kind of daily use forms like as self declaration forms, admission forms, verification forms, university forms, certificates, banking forms, dairy forms, Punjab govt forms etc. The proposed technique is capable to classify the handwritten and machine printed text written in Gurumukhi script in mixed text. The proposed technique has been tested on 150 different kinds of forms in Gurumukhi and Roman scripts. The proposed techniques achieve 93% accuracy on mixed character form and 96% accuracy achieves on unmixed character forms. The overall accuracy of the proposed technique is 94.5%.


2019 ◽  
Vol 22 (1) ◽  
pp. 109-192
Author(s):  
Emily Chesley ◽  
Jillian Marcantonio ◽  
Abigail Pearson

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.


Author(s):  
Sameer M. Patel ◽  
Sarvesh S. Pai ◽  
Mittal B. Jain ◽  
Vaibhav P. Vasani

Optical Character Recognition is basically the mechanical or electronic conversion of printed or handwritten text into machine understandable text. The complication of Optical Character Recognition in different conditions remains as relevant as it was in the past few years. At the present time of automation and innovations, Keyboarding remains the most common way of inputting or feeding data into computers. This is probably the most time consuming and labor-intensive operation in the industry. Automating the process of recognition of documents, credit cards, electronic invoices, and license plates of cars – all of this could help in saving time for analyzing and processing data. With the increased research and development of machine learning, the quality of text recognition is continuously growing better. Our paper is focused on providing a brief explanation of the different stages involved in the process of optical character recognition and through the proposed application; we aim to automate the process of extraction of important texts from electronic invoices. The main goal of the project is to develop a real time OCR web application with a micro service architecture, which would help in extracting necessary information from an invoice.


Optical Character Recognition or Optical Character Reader (OCR) is a pattern-based method consciousness that transforms the concept of electronic conversion of images of handwritten text or printed text in a text compiled. Equipment or tools used for that purpose are cameras and apartment scanners. Handwritten text is scanned using a scanner. The image of the scrutinized document is processed using the program. Identification of manuscripts is difficult compared to other western language texts. In our proposed work we will accept the challenge of identifying letters and letters and working to achieve the same. Image Preprocessing techniques can effectively improve the accuracy of an OCR engine. The goal is to design and implement a machine with a learning machine and Python that is best to work with more accurate than OCR's pre-built machines with unique technologies such as MatLab, Artificial Intelligence, Neural networks, etc.


Optical Character Recognition is the machine replication of human perusing. Electronic Conversion of examined pictures where picture can be type composed or printed content. It is executed utilizing Google's open source Optical Character Recognition programming called Tesseract. The OCR accepts picture as the information, gets content from that picture and afterward changes over it into whatever other language that the client needed. This framework can be helpful in different applications like banking, legitimate industry, explorers’ different ventures, and home and office robotization. It for the most part intended for individuals who are unfit to peruse any sort of content archives and to diminish the weight of information passage occupations.[4]


Sign in / Sign up

Export Citation Format

Share Document