Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Online feedback is an effective way of communication between government departments and citizens. However, the daily high number of public feedbacks has increased the burden on government administrators. The deep learning method is good at automatically analyzing and extracting deep features of data, and then improving the accuracy of classification prediction. In this study, we aim to use the text classification model to achieve the automatic classification of public feedbacks to reduce the work pressure of administrator. In particular, a convolutional neural network model combined with word embedding and optimized by differential evolution algorithm is adopted. At the same time, we compared it with seven common text classification models, and the results show that the model we explored has good classification performance under different evaluation metrics, including accuracy, precision, recall, and F1-score.

Download Full-text

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

10.36227/techrxiv.14795592.v2 ◽

2021 ◽

Author(s):

Komuravelli Prashanth ◽

Kalidas Yeturu

Keyword(s):

Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

High Performance ◽

Performance Metrics ◽

Training Data ◽

Mean Average Precision ◽

Average Precision ◽

Squared Error Loss Function ◽

Scanned Documents

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>

Download Full-text

Convolutional Neural Network Based Intelligent Advertisement Search Framework for Online English Newspapers

Recent Patents on Engineering ◽

10.2174/1872212115666210715163919 ◽

2021 ◽

Vol 15 ◽

Author(s):

Pooja Jain ◽

Kavita Taneja ◽

Harmunish Taneja

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Access To Information ◽

Intelligent Environment ◽

Online Newspapers ◽

Manual Search ◽

Optical Character ◽

Extraction Algorithm

Background: Instant access to desired information is crucial for building an intelligent environment that creates value for people and steering towards society 5.0. Online newspapers are one such example that provides instant access to information anywhere and anytime on our mobiles, tablets, laptops, desktops, etc. However, when it comes to searching for a specific advertisement in newspapers, online newspapers do not provide easy advertisement search options. In addition, there are no specialized search portals for keyword-based advertisement searches across multiple online newspapers. As a result, to find a specific advertisement in multiple newspapers, a sequential manual search is required across a range of online newspapers. Objective: This research paper proposes a keyword-based advertisement search framework to provide an instant access to the relevant advertisements from online English newspapers in a category of reader’s choice. Method: First, an image extraction algorithm is proposed to identify and extract the images from online newspapers without using any rules on advertisement placement and size. It is followed by a proposed deep learning Convolutional Neural Network (CNN) model named ‘Adv_Recognizer’ to separate the advertisement images from non-advertisement images. Another CNN Model, ‘Adv_Classifier’, is proposed, classifying the advertisement images into four pre-defined categories. Finally, the Optical Character Recognition (OCR) technique performs keyword-based advertisement searches in various types across multiple newspapers. Results: The proposed image extraction algorithm can easily extract all types of well-bounded images from different online newspapers. This algorithm is used to create an ‘English newspaper image dataset’ of 11,000 images, including advertisements and non-advertisements. The proposed ‘Adv_Recognizer’ model separates advertising and non-advertisement pictures with an accuracy of around 97.8%. In addition, the proposed ‘Adv_Classifier’ model classifies the advertisements in four pre-defined categories exhibiting an accuracy of approximately 73.5%. Conclusion: The proposed framework will help newspaper readers perform exhaustive advertisement searches across various online English newspapers in a category of their interest. It will also help in carrying out advertisement analysis and studies.

Download Full-text

YORÙBÁNET: A DEEP CONVOLUTIONAL NEURAL NETWORK DESIGN FOR YORÙBÁ ALPHABETS RECOGNITION

International Journal of Engineering Applied Sciences and Technology ◽

10.33564/ijeast.2021.v05i11.008 ◽

2021 ◽

Vol 5 (11) ◽

Author(s):

Oyeniran Oluwashina Akinloye ◽

Oyebode Ebenezer Olukunle

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Network Design ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition System ◽

Test Set ◽

Novel Technique ◽

Optical Character

Numerous works have been proposed and implemented in computerization of various human languages, nevertheless, miniscule effort have also been made so as to put Yorùbá Handwritten Character on the map of Optical Character Recognition. This study presents a novel technique in the development of Yorùbá alphabets recognition system through the use of deep learning. The developed model was implemented on Matlab R2018a environment using the developed framework where 10,500 samples of dataset were for training and 2100 samples were used for testing. The training of the developed model was conducted using 30 Epoch, at 164 iteration per epoch while the total iteration is 4920 iterations. Also, the training period was estimated to 11296 minutes 41 seconds. The model yielded the network accuracy of 100% while the accuracy of the test set is 97.97%, with F1 score of 0.9800, Precision of 0.9803 and Recall value of 0.9797.

Download Full-text

An Optical Character Recognition Technique for Devanagari Script Using Convolutional Neural Network and Unicode Encoding

Inventive Computation and Information Technologies - Lecture Notes in Networks and Systems ◽

10.1007/978-981-33-4305-4_14 ◽

2021 ◽

pp. 173-187

Author(s):

Vamsi Krishna Kikkuri ◽

Pavan Vemuri ◽

Srikar Talagani ◽

Yashwanth Thota ◽

Jayashree Nair

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Technique

Download Full-text

Color Model Based Convolutional Neural Network for Image Spam Classification

Al-Nahrain Journal of Science ◽

10.22401/anjs.23.4.08 ◽

2020 ◽

Vol 23 (4) ◽

pp. 44-48

Author(s):

Ahmad Mahdi Salih ◽

◽

Ban Nadeem Dhannoon ◽

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Service Providers ◽

Public Image ◽

Color Model ◽

Color Models ◽

Image Spam ◽

E Mail

For most of people, e-mail is the preferable medium for official communication. E-mail service providers face an endless challenge called spamming. Spammingis the exploitation of e-mail systems to send a bulk of unsolicited messages to a large number of recipients. Noisy image spamming is one of the new techniques to evade text analysis based and Optical Character Recognition (OCR) based spams filtering. In the present paper, Convolutional Neural Network (CNN) based on different color models was considered to address image spam problem. The proposed method was evaluated over a public image spam dataset. The results showed that the performance of the proposed CNN was affected by the color model used. The results also showed that XYZ model yields the best accuracy rate among all considered color models.

Download Full-text

Optical Character Recognition in Banking Sectors Using Convolutional Neural Network

2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) ◽

10.1109/i-smac47947.2019.9032514 ◽

2019 ◽

Author(s):

S. Gayathri ◽

R.S. Mohana

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

10.36227/techrxiv.14795592 ◽

2021 ◽

Author(s):

Komuravelli Prashanth ◽

Kalidas Yeturu

Keyword(s):

Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

High Performance ◽

Performance Metrics ◽

Training Data ◽

Mean Average Precision ◽

Average Precision ◽

Squared Error Loss Function ◽

Scanned Documents

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>

Download Full-text

Implementasi Keras Library dan Convolutional Neural Network Pada Konversi Formulir Pendaftaran Siswa

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v3i3.1338 ◽

2019 ◽

Vol 3 (3) ◽

pp. 524-531

Author(s):

Wahyu Andi Saputra ◽

Muhammad Zidny Naf’an ◽

Asyhar Nurrochman

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Early Experiment ◽

Testing Method ◽

Digital Storage ◽

Average Accuracy ◽

Physical Form ◽

Written Form

Form sheet is an instrument to collect someone’s information and in most cases it is used in a registration or submission process. The challenge being faced by physical form sheet (e.g. paper) is how to convert its content into digital form. As a part of study of computer vision, Optical Character Recognition (OCR) recently utilized to identify hand-written character by learning pattern characteristics of an object. In this research, OCR is implemented to facilitate the conversion of paper-based form sheet's content to be stored properly into digital storage. In order to recognize the character's pattern, this research develops training and testing method in a Convolutional Neural Network (CNN) environment. There are 262.924 images of hand-written character sample and 29 paper-based form sheets from SDN 01 Gumilir Cilacap that implemented in this research. The form sheets also contain various sample of human-based hand-written character. From the early experiment, this research results 92% of accuracy and 23% of loss. However, as the model is implemented to the real form sheets, it obtains average accuracy value of 63%. It is caused by several factors that related to character's morphological feature. From the conducted research, it is expected that conversion of hand-written form sheets become effortless.

Download Full-text

Penerapan Convolutional Neural Network (CNN) pada Pengenalan Aksara Lampung Berbasis Optical Character Recognition (OCR)

Jurnal Edukasi dan Penelitian Informatika (JEPIN) ◽

10.26418/jp.v7i1.44133 ◽

2021 ◽

Vol 7 (1) ◽

pp. 52

Author(s):

Agus Mulyanto ◽

Erlina Susanti ◽

Farli Rossi ◽

Wajiran Wajiran ◽

Rohmat Indra Borman

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Model Training ◽

Fully Connected

Provinsi Lampung memiliki bahasa dan aksara daerah yang disebut juga dengan Had Lampung atau KaGaNga yang merupakan aksara asli lampung. Melihat bagaimana pentingnya nilai akan eksistensi sebuah budaya dan pentingnya pelestarian aksara lampung maka dibutuhkan teknologi yang membantu dalam mengenalkan aksara lampung, salah satunya dengan teknologi optical character recognition (OCR) yang digunakan untuk merubah citra kedalam teks. Untuk mengenali pola citra Aksara Lampung dan klasifikasi model maka digunakan Convolutional Neural Network (CNN). CNN memiliki lapisan convolution yang terbentuk dari beberapa gabungan lapisan konvolusi, lapisan pooling dan lapisan fully connected. Pada peneilitian yang dilakukan dataset dikembangkan dengan pengumpulan hasil tulis tangan dari sampel responden yang telah ditentukan, kemudian dilakukan scanning gambar. Selanjutnya, dilakukan proses pelabelan dan disimpan dengan format YOLO yaitu TXT. Dari asitektur CNN yang dibangun berdasarkan hasil evaluasi menunjukan loss, accuracy menghasilkan nilai training accuracy mendapatkan nilai sebesar 0.57 dan precision mendapatkan nilai sebesar 0.87. Dari hasil nilai accuracy dan precision menunjukkan bahwa model training sudah baik karena mendekati angka 1.

Download Full-text