OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Journal of Computational Social Science ◽

10.1007/s42001-021-00149-1 ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

AbstractOptical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Download Full-text

OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment

10.31235/osf.io/6zfvs ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n=322) and Arabic-language article scans (n=100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) were substantially more accurate than Tesseract, especially on noisy documents. Accuracy for English was considerably better than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available "Noisy OCR Dataset" (NOD).

Download Full-text

Optical character recognition (OCR) system for Roman script & English language using Artificial Neural Network (ANN) classifier

2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS) ◽

10.1109/rains.2016.7764379 ◽

2016 ◽

Cited By ~ 4

Author(s):

Honey Mehta ◽

Sanjay Singla ◽

Aarti Mahajan

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Optical Character ◽

Artificial Neural ◽

Artificial Neural Network Ann ◽

Roman Script ◽

Ann Classifier

Download Full-text

Corpus-based technique for improving Arabic OCR system

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i1.pp233-241 ◽

2021 ◽

Vol 21 (1) ◽

pp. 233

Author(s):

Ahmed Hussain Aliwy ◽

Basheer Al-Sadawi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Arabic Language ◽

Document Images ◽

Statistical Language Model ◽

Text Document ◽

Optical Character ◽

Arabic Ocr

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Download Full-text

Arabic Optical Character Recognition

Applied Signal and Image Processing ◽

10.4018/978-1-60960-477-6.ch019 ◽

2011 ◽

pp. 324-346 ◽

Cited By ~ 1

Author(s):

Husni Al-Muhtaseb ◽

Rami Qahwaji

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Arabic Language ◽

Text Recognition ◽

Text Segmentation ◽

Future Trends ◽

Optical Character ◽

Arabic Ocr ◽

Processing Techniques ◽

Arabic Speaking

Arabic text recognition is receiving more attentions from both Arabic and non-Arabic-speaking researchers. This chapter provides a general overview of the state-of-the-art in Arabic Optical Character Recognition (OCR) and the associated text recognition technology. It also investigates the characteristics of the Arabic language with respect to OCR and discusses related research on the different phases of text recognition including: pre-processing and text segmentation, common feature extraction techniques, classification methods and post-processing techniques. Moreover, the chapter discusses the available databases for Arabic OCR research and lists the available commercial Software. Finally, it explores the challenges related to Arabic OCR and discusses possible future trends.

Download Full-text

Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction

International Journal of Reasoning-based Intelligent Systems ◽

10.1504/ijris.2016.082957 ◽

2016 ◽

Vol 8 (3/4) ◽

pp. 91

Author(s):

Iyad Abu Doush ◽

Ahmed M. Al Trad

Keyword(s):

Error Detection ◽

Character Recognition ◽

Optical Character Recognition ◽

Arabic Language ◽

Spelling Error ◽

Post Processing ◽

Optical Character ◽

Error Detection And Correction

Download Full-text

Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction

International Journal of Reasoning-based Intelligent Systems ◽

10.1504/ijris.2016.10003960 ◽

2016 ◽

Vol 8 (3/4) ◽

pp. 91 ◽

Cited By ~ 1

Author(s):

Ahmed M. Al Trad ◽

Iyad Abu Doush

Keyword(s):

Error Detection ◽

Character Recognition ◽

Optical Character Recognition ◽

Arabic Language ◽

Spelling Error ◽

Post Processing ◽

Optical Character ◽

Error Detection And Correction

Download Full-text

A Proposed Arabic Text Encryption Method Using Multiple Ciphers

Webology ◽

10.14704/web/v18si04/web18131 ◽

2021 ◽

Vol 18 (Special Issue 04) ◽

pp. 319-326

Author(s):

Ammar Sabeeh Hmoud Altamimi ◽

Ali Mohsin Kaittan

Keyword(s):

Mathematical Model ◽

English Language ◽

Selection Process ◽

Arabic Language ◽

Arabic Text ◽

Method Selection ◽

Encryption Method ◽

Better Than

Most encryption techniques are deals with English language, but that deals with Arabic language are few. Therefore, many researchers interests with encryption ciphers that applied on text which wrote in Arabic language. This reason is behind this paper. In this paper, there are three cipher methods implemented together on Arabic text. Using more than one cipher method is increase the security of algorithm used. Each letter of plaintext is encrypted by a specified cipher method. Selection process of one of three cipher methods used in this work is done by controlling process that selects one cipher method to encrypt one letter of plaintext. The cipher methods that used in this paper are RSA, Playfair and Vignere. Each one of them has different basis mathematical model. This proposed encryption Arabic text method gives results better than previous related papers.

Download Full-text