scanned documents
Recently Published Documents


TOTAL DOCUMENTS

135
(FIVE YEARS 44)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Rifiana Arief ◽  
Achmad Benny Mutiara ◽  
Tubagus Maulana Kusuma ◽  
Hustinawaty Hustinawaty

<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>


Mathematics ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 8
Author(s):  
Yongjin Hu ◽  
Xiyan Li ◽  
Jun Ma

This paper analyzes random bits and scanned documents, two forms of secret data. The secret data were pre-processed by halftone, quadtree, and S-Box transformations, and the size of the scanned document was reduced by 8.11 times. A novel LSB matching algorithm with low distortion was proposed for the embedding step. The golden ratio was firstly applied to find the optimal embedding position and was used to design the matching function. Both theory and experiment have demonstrated that our study presented a good trade-off between high capacity and low distortion and is superior to other related schemes.


2021 ◽  
Vol 34 (4) ◽  
pp. 130-141
Author(s):  
Atheel Sabih Shaker

     The brain's magnetic resonance imaging (MRI) is tasked with finding the pixels or voxels that establish where the brain is in a medical image The Convolutional Neural Network (CNN) can process curved baselines that frequently occur in scanned documents. Next, the lines are separated into characters. In the Convolutional Neural Network (CNN) can process curved baselines that frequently occur in scanned documents case of fonts with a fixed MRI width, the gaps are analyzed and split. Otherwise, a limited region above the baseline is analyzed, separated, and classified. The words with the lowest recognition score are split into further characters x until the result improves. If this does not improve the recognition score, contours are merged and classified again to check the change in the recognition score. The features for classification are extracted from small fixed-size patches over neighboring contours and matched against the trained deep learning representations this approach enables Tesseract to easily handle MRI sample results broken into multiple parts, which is impossible if each contour is processed separately Hard to read! Try to split sentences. The CNN inception network seem to be a suitable choice for the evaluation of the synthetic MRI samples with 3000 features, and 12000 samples of images as data augmentation capacities favors data which is similar to the original training set and thus unlikely to contain new information content with an accuracy of 98.68%. The error is only 1.32% with the increasing the number of training samples, but the most significant impact in reducing the error can be made by increasing the number of samples.


2021 ◽  
Author(s):  
Samundeswari S ◽  
Jeshoorin G ◽  
Vasanth M

Insurance companies are regularly provided with health check reports by the buyers of insurance. Different forms of printed lab reports/health check reports have to be digitized for each value of captured parameters. Optical Character Recognition (OCR), is used to convert the images of handwritten, typed, printed text or any kind of scanned documents into machine-encoded text in order to digitize the values from the report. Conversion to this standard set of digital values will benefit in automating a lot of backend approval process. we collect the reports from the user and read the values from the report and scrutinize the values. Based on the values with the company’s standard set, the scrutinization is done and it is then visualized using any visualization tool. The result is presented to the user so that the user can get an idea whether he/she is eligible for insurance claim. The foremost objective of this paper is making the insurance backend approval process a lot easier and a quick response to the buyers.


2021 ◽  
Author(s):  
Petar Prvulović ◽  
Jelena Vasiljević ◽  
Dhinaharan Nagamalai

This paper explains a method used to detect the presence of impulse noise in a set of scanned documents as a part of OCR preprocessing. As the document set is supposed to be processed in large scale, the primary concern of the noise detection method was efficiency within existing project constraints. Following the nature of noise, the method seeks to detect the presence of noise in document margins. The method works in two stages. First stage is margin detection, based on color spectre analysis. Second stage is noise recognition in margin samples, based on a pixel contrast score. The resulting implementation proved efficient both in terms of detection accuracy and algorithmic complexity.


2021 ◽  
Vol 24 (4) ◽  
pp. 667-688
Author(s):  
Rustem Damirovich Saitgareev ◽  
Bulat Rifatovich Giniatullin ◽  
Vladislav Yurievich Toporov ◽  
Artur Aleksandrovich Atnagulov ◽  
Farid Radikovich Aglyamov

Currently, the major part of transmitted and stored data is unstructured, and the amount of unstructured data is growing rapidly each year, although it is hardly searchable, unqueryable, and its processing is not automated. At the same time, there is a growth of electronic document management systems. This paper proposes a solution for extracting data from paper documents considering their structure and layout based on document photos. By examining different approaches, including neural networks and plain algorithmic methods, we present their results and discuss them.


2021 ◽  
Vol 2021 ◽  
Author(s):  
Cyprien Plateau-Holleville ◽  
Enzo Bonnot ◽  
Franck Gechter ◽  
Laurent Heyberger

International audience Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.


2021 ◽  
Author(s):  
Komuravelli Prashanth ◽  
Kalidas Yeturu

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>


2021 ◽  
Author(s):  
Komuravelli Prashanth ◽  
Kalidas Yeturu

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.</div>


Sign in / Sign up

Export Citation Format

Share Document