scholarly journals Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Author(s):  
Rifiana Arief ◽  
Achmad Benny Mutiara ◽  
Tubagus Maulana Kusuma ◽  
Hustinawaty Hustinawaty

<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>

2019 ◽  
Vol 14 (1) ◽  
pp. 124-134 ◽  
Author(s):  
Shuai Zhang ◽  
Yong Chen ◽  
Xiaoling Huang ◽  
Yishuai Cai

Online feedback is an effective way of communication between government departments and citizens. However, the daily high number of public feedbacks has increased the burden on government administrators. The deep learning method is good at automatically analyzing and extracting deep features of data, and then improving the accuracy of classification prediction. In this study, we aim to use the text classification model to achieve the automatic classification of public feedbacks to reduce the work pressure of administrator. In particular, a convolutional neural network model combined with word embedding and optimized by differential evolution algorithm is adopted. At the same time, we compared it with seven common text classification models, and the results show that the model we explored has good classification performance under different evaluation metrics, including accuracy, precision, recall, and F1-score.


2021 ◽  
Author(s):  
Komuravelli Prashanth ◽  
Kalidas Yeturu

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>


2021 ◽  
Vol 15 ◽  
Author(s):  
Pooja Jain ◽  
Kavita Taneja ◽  
Harmunish Taneja

Background: Instant access to desired information is crucial for building an intelligent environment that creates value for people and steering towards society 5.0. Online newspapers are one such example that provides instant access to information anywhere and anytime on our mobiles, tablets, laptops, desktops, etc. However, when it comes to searching for a specific advertisement in newspapers, online newspapers do not provide easy advertisement search options. In addition, there are no specialized search portals for keyword-based advertisement searches across multiple online newspapers. As a result, to find a specific advertisement in multiple newspapers, a sequential manual search is required across a range of online newspapers. Objective: This research paper proposes a keyword-based advertisement search framework to provide an instant access to the relevant advertisements from online English newspapers in a category of reader’s choice. Method: First, an image extraction algorithm is proposed to identify and extract the images from online newspapers without using any rules on advertisement placement and size. It is followed by a proposed deep learning Convolutional Neural Network (CNN) model named ‘Adv_Recognizer’ to separate the advertisement images from non-advertisement images. Another CNN Model, ‘Adv_Classifier’, is proposed, classifying the advertisement images into four pre-defined categories. Finally, the Optical Character Recognition (OCR) technique performs keyword-based advertisement searches in various types across multiple newspapers. Results: The proposed image extraction algorithm can easily extract all types of well-bounded images from different online newspapers. This algorithm is used to create an ‘English newspaper image dataset’ of 11,000 images, including advertisements and non-advertisements. The proposed ‘Adv_Recognizer’ model separates advertising and non-advertisement pictures with an accuracy of around 97.8%. In addition, the proposed ‘Adv_Classifier’ model classifies the advertisements in four pre-defined categories exhibiting an accuracy of approximately 73.5%. Conclusion: The proposed framework will help newspaper readers perform exhaustive advertisement searches across various online English newspapers in a category of their interest. It will also help in carrying out advertisement analysis and studies.


Author(s):  
Oyeniran Oluwashina Akinloye ◽  
Oyebode Ebenezer Olukunle

Numerous works have been proposed and implemented in computerization of various human languages, nevertheless, miniscule effort have also been made so as to put Yorùbá Handwritten Character on the map of Optical Character Recognition. This study presents a novel technique in the development of Yorùbá alphabets recognition system through the use of deep learning. The developed model was implemented on Matlab R2018a environment using the developed framework where 10,500 samples of dataset were for training and 2100 samples were used for testing. The training of the developed model was conducted using 30 Epoch, at 164 iteration per epoch while the total iteration is 4920 iterations. Also, the training period was estimated to 11296 minutes 41 seconds. The model yielded the network accuracy of 100% while the accuracy of the test set is 97.97%, with F1 score of 0.9800, Precision of 0.9803 and Recall value of 0.9797.


2020 ◽  
Vol 23 (4) ◽  
pp. 44-48
Author(s):  
Ahmad Mahdi Salih ◽  
◽  
Ban Nadeem Dhannoon ◽  

For most of people, e-mail is the preferable medium for official communication. E-mail service providers face an endless challenge called spamming. Spammingis the exploitation of e-mail systems to send a bulk of unsolicited messages to a large number of recipients. Noisy image spamming is one of the new techniques to evade text analysis based and Optical Character Recognition (OCR) based spams filtering. In the present paper, Convolutional Neural Network (CNN) based on different color models was considered to address image spam problem. The proposed method was evaluated over a public image spam dataset. The results showed that the performance of the proposed CNN was affected by the color model used. The results also showed that XYZ model yields the best accuracy rate among all considered color models.


2021 ◽  
Author(s):  
Komuravelli Prashanth ◽  
Kalidas Yeturu

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>


2019 ◽  
Vol 3 (3) ◽  
pp. 524-531
Author(s):  
Wahyu Andi Saputra ◽  
Muhammad Zidny Naf’an ◽  
Asyhar Nurrochman

Form sheet is an instrument to collect someone’s information and in most cases it is used in a registration or submission process. The challenge being faced by physical form sheet (e.g. paper) is how to convert its content into digital form. As a part of study of computer vision, Optical Character Recognition (OCR) recently utilized to identify hand-written character by learning pattern characteristics of an object. In this research, OCR is implemented to facilitate the conversion of paper-based form sheet's content to be stored properly into digital storage. In order to recognize the character's pattern, this research develops training and testing method in a Convolutional Neural Network (CNN) environment. There are 262.924 images of hand-written character sample and 29 paper-based form sheets from SDN 01 Gumilir Cilacap that implemented in this research. The form sheets also contain various sample of human-based hand-written character. From the early experiment, this research results 92% of accuracy and 23% of loss. However, as the model is implemented to the real form sheets, it obtains average accuracy value of 63%. It is caused by several factors that related to character's morphological feature. From the conducted research, it is expected that conversion of hand-written form sheets become effortless.


2021 ◽  
Vol 7 (1) ◽  
pp. 52
Author(s):  
Agus Mulyanto ◽  
Erlina Susanti ◽  
Farli Rossi ◽  
Wajiran Wajiran ◽  
Rohmat Indra Borman

Provinsi Lampung memiliki bahasa dan aksara daerah yang disebut juga dengan Had Lampung atau KaGaNga yang merupakan aksara asli lampung. Melihat bagaimana pentingnya nilai akan eksistensi sebuah budaya dan pentingnya pelestarian aksara lampung maka dibutuhkan teknologi yang membantu dalam mengenalkan aksara lampung, salah satunya dengan teknologi optical character recognition (OCR) yang digunakan untuk merubah citra kedalam teks. Untuk mengenali pola citra Aksara Lampung dan klasifikasi model maka digunakan Convolutional Neural Network (CNN). CNN memiliki lapisan convolution yang terbentuk dari beberapa gabungan lapisan konvolusi, lapisan pooling dan lapisan fully connected. Pada peneilitian yang dilakukan dataset dikembangkan dengan pengumpulan hasil tulis tangan dari sampel responden yang telah ditentukan, kemudian dilakukan scanning gambar. Selanjutnya, dilakukan proses pelabelan dan disimpan dengan format YOLO yaitu TXT. Dari asitektur CNN yang dibangun berdasarkan hasil evaluasi menunjukan loss, accuracy menghasilkan nilai training accuracy mendapatkan nilai sebesar 0.57 dan precision mendapatkan nilai sebesar 0.87. Dari hasil nilai accuracy dan precision menunjukkan bahwa model training sudah baik karena mendekati angka 1.


Sign in / Sign up

Export Citation Format

Share Document