Comparison of text-image fusion models for high school diploma certificate classification

Chandra Ramadhan Atmaja Perdana; Hanung Adi Nugroho; Igi Ardiyanto

doi:10.21924/cst.5.1.2020.172

Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i1.pp1018-1029 ◽

2022 ◽

Vol 12 (1) ◽

pp. 1018

Author(s):

Rifiana Arief ◽

Achmad Benny Mutiara ◽

Tubagus Maulana Kusuma ◽

Hustinawaty Hustinawaty

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Regular Expression ◽

Hierarchical Classification ◽

Document Vector ◽

Classification Prediction ◽

Scanned Documents

<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>

Download Full-text

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

10.36227/techrxiv.14795592.v2 ◽

2021 ◽

Author(s):

Komuravelli Prashanth ◽

Kalidas Yeturu

Keyword(s):

Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

High Performance ◽

Performance Metrics ◽

Training Data ◽

Mean Average Precision ◽

Average Precision ◽

Squared Error Loss Function ◽

Scanned Documents

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>

Download Full-text

Offline optical character recognition (OCR) method: An effective method for scanned documents

2019 22nd International Conference on Computer and Information Technology (ICCIT) ◽

10.1109/iccit48885.2019.9038593 ◽

2019 ◽

Author(s):

Mujibur Rahman Majumder ◽

Bahar Uddin Mahmud ◽

Busrat Jahan ◽

Mahbubul Alam

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Scanned Documents

Download Full-text

A Generic System for Processing Insurance

10.3233/apc210091 ◽

2021 ◽

Author(s):

Samundeswari S ◽

Jeshoorin G ◽

Vasanth M

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Health Check ◽

Insurance Companies ◽

Visualization Tool ◽

Approval Process ◽

Standard Set ◽

Printed Text ◽

Scanned Documents ◽

Generic System

Insurance companies are regularly provided with health check reports by the buyers of insurance. Different forms of printed lab reports/health check reports have to be digitized for each value of captured parameters. Optical Character Recognition (OCR), is used to convert the images of handwritten, typed, printed text or any kind of scanned documents into machine-encoded text in order to digitize the values from the report. Conversion to this standard set of digital values will benefit in automating a lot of backend approval process. we collect the reports from the user and read the values from the report and scrutinize the values. Based on the values with the company’s standard set, the scrutinization is done and it is then visualized using any visualization tool. The result is presented to the user so that the user can get an idea whether he/she is eligible for insurance claim. The foremost objective of this paper is making the insurance backend approval process a lot easier and a quick response to the buyers.

Download Full-text

Javanese Character Feature Extraction Based on Shape Energy

EMITTER International Journal of Engineering Technology ◽

10.24003/emitter.v5i1.175 ◽

2017 ◽

Vol 5 (1) ◽

pp. 154-169 ◽

Cited By ~ 2

Author(s):

Galih Hendra Wibowo ◽

Riyanto Sigit ◽

Aliridho Barakbah

Keyword(s):

Feature Extraction ◽

Character Recognition ◽

Optical Character Recognition ◽

Performance Test ◽

Test Results ◽

K Nearest Neighbors ◽

Method Performance ◽

Feature Extraction Method ◽

Character Feature ◽

Better Than

Javanese character is one of Indonesia's noble culture, especially in Java. However, the number of Javanese people who are able to read the letter has decreased so that there need to be conservation efforts in the form of a system that is able to recognize the characters. One solution to these problem lies in Optical Character Recognition (OCR) studies, where one of its heaviest points lies in feature extraction which is to distinguish each character. Shape Energy is one of feature extraction method with the basic idea of how the character can be distinguished simply through its skeleton. Based on the basic idea, then the development of feature extraction is done based on its components to produce an angular histogram with various variations of multiples angle. Furthermore, the performance test of this method and its basic method is performed in Javanese character dataset, which has been obtained from various images, is 240 data with 19 labels by using K-Nearest Neighbors as its classification method. Performance values were obtained based on the accuracy which is generated through the Cross-Validation process of 80.83% in the angular histogram with an angle of 20 degrees, 23% better than Shape Energy. In addition, other test results show that this method is able to recognize rotated character with the lowest performance value of 86% at 180-degree rotation and the highest performance value of 96.97% at 90-degree rotation. It can be concluded that this method is able to improve the performance of Shape Energy in the form of recognition of Javanese characters as well as robust to the rotation.

Download Full-text

OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment

10.31235/osf.io/6zfvs ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n=322) and Arabic-language article scans (n=100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) were substantially more accurate than Tesseract, especially on noisy documents. Accuracy for English was considerably better than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available "Noisy OCR Dataset" (NOD).

Download Full-text

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

10.36227/techrxiv.14795592 ◽

2021 ◽

Author(s):

Komuravelli Prashanth ◽

Kalidas Yeturu

Keyword(s):

Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

High Performance ◽

Performance Metrics ◽

Training Data ◽

Mean Average Precision ◽

Average Precision ◽

Squared Error Loss Function ◽

Scanned Documents

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>

Download Full-text

Development and customization of in-house developed OCR and its evaluation

The Electronic Library ◽

10.1108/el-01-2018-0011 ◽

2018 ◽

Vol 36 (5) ◽

pp. 766-781

Author(s):

Rajeswari S. ◽

Sai Baba Magapu

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Web Based ◽

Content Type ◽

Text Extraction ◽

Manual Intervention ◽

Key Phrases ◽

Microsoft Office ◽

Read Accuracy ◽

Scanned Documents

Purpose The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention. Design/methodology/approach For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document. Findings The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing. Social implications The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention. Originality/value The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.

Download Full-text

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Journal of Computational Social Science ◽

10.1007/s42001-021-00149-1 ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

AbstractOptical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Download Full-text

FUZZY C-MEAN: A STATISTICAL FEATURE CLASSIFICATION OF TEXT AND IMAGE SEGMENTATION METHOD

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488501001137 ◽

2001 ◽

Vol 09 (06) ◽

pp. 661-671 ◽

Cited By ~ 23

Author(s):

S. CHUAI-AREE ◽

C. LURSINSAP ◽

P. SOPHASATHIT ◽

S. SIRIPANT

Keyword(s):

Image Segmentation ◽

Character Recognition ◽

Optical Character Recognition ◽

Segmentation Method ◽

Text And Image ◽

Optical Character ◽

Pixel Color ◽

Systematic Structure ◽

Cluster Location

Classification of text and image using statistical features (mean and standard deviation of pixel color values) is found to be a simple yet powerful method for text and image segmentation. The features constitute a systematic structure that segregates one from another. We identified this segregation in the form of class clustering by means of Fuzzy C-Mean method, which determined each cluster location using maximum membership defuzzification and neighborhood smoothing techniques. The method can then be applied to classify text, image, and background areas in optical character recognition (OCR) application for elaborated open document systems.

Download Full-text