Multi-Oriented Text Extraction in Stylistic Documents

In many documents such as maps, engineering drawings and artistic documents, etc. there exist many printed as well as handwritten materials where text regions and text-lines are not parallel to each other, curved in nature, and having various types of text such as different font size, text and non-text areas lying close to each other and non-straight, skewed and warped text-lines. Optical character recognition (OCR) systems available commercially such as ABYY fine reader and Free OCR, are not capable of handling different ranges of stylistic document images containing curved, multi-oriented, and stylish font text-lines. Extraction of individual text-lines and words from these documents is generally not straight forward. Most of the segmentation works reported is on simple documents but still it remains a highly challenging task to implement an OCR that works under all possible conditions and gives highly accurate results, especially in the case of stylistic documents. This paper presents dilation and flood fill morphological operations based approach that extracts multi-oriented text-lines and words from the complex layout or stylistic document images in the subsequent stages. The segmentation results obtained from our method proves to be superior over the standard profiling-based method.

Download Full-text

ConvNet-Based Optical Recognition for Engineering Drawings

Volume 1: 37th Computers and Information in Engineering Conference ◽

10.1115/detc2017-68186 ◽

2017 ◽

Author(s):

Andrew Brock ◽

Theodore Lim ◽

J. M. Ritchie ◽

Nick Weston

Keyword(s):

Object Detection ◽

Character Recognition ◽

Optical Character Recognition ◽

Cross Validation ◽

Convolutional Networks ◽

Optical Character ◽

Engineering Drawings ◽

Machine Analysis ◽

Fold Cross Validation ◽

Optical Recognition

End-to-end machine analysis of engineering document drawings requires a reliable and precise vision frontend capable of localizing and classifying various characters in context. We develop an object detection framework, based on convolutional networks, designed specifically for optical character recognition in engineering drawings. Our approach enables classification and localization on a 10-fold cross-validation of an internal dataset for which other techniques prove unsuitable.

Download Full-text

An Improved Scene Text Extraction Method Using Conditional Random Field and Optical Character Recognition

2011 International Conference on Document Analysis and Recognition ◽

10.1109/icdar.2011.148 ◽

2011 ◽

Cited By ~ 20

Author(s):

Hongwei Zhang ◽

Changsong Liu ◽

Cheng Yang ◽

Xiaoqing Ding ◽

KongQiao Wang

Keyword(s):

Random Field ◽

Character Recognition ◽

Optical Character Recognition ◽

Extraction Method ◽

Conditional Random Field ◽

Text Extraction ◽

Optical Character ◽

Scene Text

Download Full-text

Corpus-based technique for improving Arabic OCR system

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i1.pp233-241 ◽

2021 ◽

Vol 21 (1) ◽

pp. 233

Author(s):

Ahmed Hussain Aliwy ◽

Basheer Al-Sadawi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Arabic Language ◽

Document Images ◽

Statistical Language Model ◽

Text Document ◽

Optical Character ◽

Arabic Ocr

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Download Full-text

Optical Character Recognition from Printed Text Images

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952175 ◽

2019 ◽

pp. 597-604 ◽

Cited By ~ 1

Author(s):

Dr. T. Kameswara Rao ◽

K. Yashwanth Chowdary ◽

I. Koushik Chowdary ◽

K. Prasanna Kumar ◽

Ch. Ramesh

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Corner Point ◽

Document Retrieval ◽

Document Images ◽

Image Region ◽

Text Documents ◽

Optical Character ◽

Simple Program ◽

Point Detection

In recent years, text extraction from document images is one of the most widely studied topics in Image Analysis and Optical Character Recognition. These extractions of document images can be used for document analysis, content analysis, document retrieval and many more. Many complex text extracting processes Maximization Likelihood (ML), Edge point detection, Corner point detection etc. are used to extract text documents from images. In this article, the corner point approach was used. To extract document from images we used a very simple approach based on FAST algorithm. Firstly, we divided the image into blocks and their density in each block was checked. The denser blocks were labeled as text blocks and the less dense were the image region or noise. Then we check the connectivity of the blocks to group the blocks so that the text part can be isolated from the image. This method is very fast and versatile, it can be used to detect various languages, handwriting and even images with a lot of noise and blur. Even though it is a very simple program the precision of this method is closer or higher than 90%. In conclusion, this method helps in more accurate and less complex detection of text from document images.

Download Full-text

Robust Combined Binarization Method of Non-Uniformly Illuminated Document Images for Alphanumerical Character Recognition

Sensors ◽

10.3390/s20102914 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2914

Author(s):

Hubert Michalak ◽

Krzysztof Okarma

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Recognition Accuracy ◽

Image Data ◽

Document Images ◽

Historical Document ◽

Image Binarization ◽

Optical Character ◽

Binarization Method ◽

Camera Sensors

Image binarization is one of the key operations decreasing the amount of information used in further analysis of image data, significantly influencing the final results. Although in some applications, where well illuminated images may be easily captured, ensuring a high contrast, even a simple global thresholding may be sufficient, there are some more challenging solutions, e.g., based on the analysis of natural images or assuming the presence of some quality degradations, such as in historical document images. Considering the variety of image binarization methods, as well as their different applications and types of images, one cannot expect a single universal thresholding method that would be the best solution for all images. Nevertheless, since one of the most common operations preceded by the binarization is the Optical Character Recognition (OCR), which may also be applied for non-uniformly illuminated images captured by camera sensors mounted in mobile phones, the development of even better binarization methods in view of the maximization of the OCR accuracy is still expected. Therefore, in this paper, the idea of the use of robust combined measures is presented, making it possible to bring together the advantages of various methods, including some recently proposed approaches based on entropy filtering and a multi-layered stack of regions. The experimental results, obtained for a dataset of 176 non-uniformly illuminated document images, referred to as the WEZUT OCR Dataset, confirm the validity and usefulness of the proposed approach, leading to a significant increase of the recognition accuracy.

Download Full-text

A Detailed Review on Text Extraction Using Optical Character Recognition

ICT Analysis and Applications - Lecture Notes in Networks and Systems ◽

10.1007/978-981-16-5655-2_69 ◽

2022 ◽

pp. 719-728

Author(s):

Chhanam Thorat ◽

Aishwarya Bhat ◽

Padmaja Sawant ◽

Isha Bartakke ◽

Swati Shirsath

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Detailed Review ◽

Text Extraction ◽

Optical Character

Download Full-text

Photometric Ligature Extraction Technique for Urdu Optical Character Recognition

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.4596 ◽

2021 ◽

Vol 11 (6) ◽

pp. 7968-7973

Author(s):

M. Kazmi ◽

F. Yasir ◽

S. Habib ◽

M. S. Hayat ◽

S. A. Qazi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Holistic Approach ◽

Extraction Technique ◽

Connected Component ◽

Font Size ◽

Optical Character ◽

Segmentation Algorithms ◽

Printed Text ◽

Text Images

Urdu Optical Character Recognition (OCR) based on character level recognition (analytical approach) is less popular as compared to ligature level recognition (holistic approach) due to its added complexity, characters and strokes overlapping. This paper presents a holistic approach Urdu ligature extraction technique. The proposed Photometric Ligature Extraction (PLE) technique is independent of font size and column layout and is capable to handle non-overlapping and all inter and intra overlapping ligatures. It uses a customized photometric filter along with the application of X-shearing and padding with connected component analysis, to extract complete ligatures instead of extracting primary and secondary ligatures separately. A total of ~ 2,67,800 ligatures were extracted from scanned Urdu Nastaliq printed text images with an accuracy of 99.4%. Thus, the proposed framework outperforms the existing Urdu Nastaliq text extraction and segmentation algorithms. The proposed PLE framework can also be applied to other languages using the Nastaliq script style, languages such as Arabic, Persian, Pashto, and Sindhi.

Download Full-text

LANGUAGE INDEPENDENT ROBUST SKEW DETECTION AND CORRECTION TECHNIQUE FOR DOCUMENT IMAGES

International Journal of Electronics Signals and Systems ◽

10.47893/ijess.2012.1077 ◽

2012 ◽

pp. 111-115

Author(s):

Neha. N

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Analysis ◽

Document Image ◽

Document Images ◽

Novel Technique ◽

Skew Detection ◽

Optical Character ◽

Correction Technique ◽

All Optical

Document image processing is an increasingly important technology essential in all optical character recognition (OCR) systems and for automation of various office documents. A document originally has zero-skew (tilt), but when a page is scanned or photo copied, skew may be introduced due to various factors and is practically unavoidable. Presence even a small amount of skew (0.50) will have detrimental effects on document analysis as it has a direct effect on the reliability and efficiency of segmentation, recognition and feature extraction stages. Therefore removal of skew is of paramount importance in the field of document analysis and OCR and is the first step to be accomplished. This paper presents a novel technique for skew detection and correction which is both language and content independent. The proposed technique is based on the maximum density of the foreground pixels and their orientation in the document image. Unlike other conventional algorithms which work only for machine printed textual documents scripted in English, this technique works well for all kinds of document images (machine printed, hand written, complex, noisy and simple). The technique presented here is tested with 150 different document image samples and is found to provide results with an accuracy of 0.10

Download Full-text

Efficient Image Denoising for Effective Digitization using Image Processing Techniques and Neural Networks

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2016100105 ◽

2016 ◽

Vol 7 (4) ◽

pp. 77-93 ◽

Cited By ~ 1

Author(s):

K.G. Srinivasa ◽

B.J. Sowmya ◽

D. Pradeep Kumar ◽

Chetan Shetty

Keyword(s):

Neural Networks ◽

Image Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Network Models ◽

Morphological Operations ◽

Neural Network Models ◽

Image Processing Techniques ◽

Optical Character ◽

Processing Techniques

Vast reserves of information are found in ancient texts, scripts, stone tablets etc. However due to difficulty in creating new physical copies of such texts, knowledge to be obtained from them is limited to those few who have access to such resources. With the advent of Optical Character Recognition (OCR) efforts have been made to digitize such information. This increases their availability by making it easier to share, search and edit. Many documents are held back due to being damaged. This gives rise to an interesting problem of removing the noise from such documents so it becomes easier to apply OCR on them. Here the authors aim to develop a model that helps denoise images of such documents retaining on the text. The primary goal of their project is to help ease document digitization. They intend to study the effects of combining image processing techniques and neural networks. Image processing techniques like thresholding, filtering, edge detection, morphological operations, etc. will be applied to pre-process images to yield higher accuracy of neural network models.

Download Full-text

Handwritten Indic Script Identification in Multi-Script Document Images: A Survey

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001418560128 ◽

2018 ◽

Vol 32 (10) ◽

pp. 1856012 ◽

Cited By ~ 6

Author(s):

Sk. Md. Obaidullah ◽

K. C. Santosh ◽

Nibaran Das ◽

Chayan Halder ◽

Kaushik Roy

Keyword(s):

Feature Extraction ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Images ◽

Extraction Techniques ◽

Script Identification ◽

Optical Character ◽

Comprehensive Survey ◽

Indic Script

Script identification is crucial for automating optical character recognition (OCR) in multi-script documents since OCRs are script-dependent. In this paper, we present a comprehensive survey of the techniques developed for handwritten Indic script identification. Different pre-processing and feature extraction techniques, including classifiers used for script identification, are categorized and their merits and demerits are discussed. We also provide information about some handwritten Indic script datasets. Finally, we highlight the extensions and/or future scope of works together with challenges.

Download Full-text