Printed Japanese Character Recognition Using Multiple Commercial OCRs

Hidetoshi Miyao;  ; Yasuaki Nakano; Atsuhiko Tani; Hirosato Tabaru; Toshihiro Hananoi;  ;  ;

doi:10.20965/jaciii.2004.p0200

Printed Japanese Character Recognition Using Multiple Commercial OCRs

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2004.p0200 ◽

2004 ◽

Vol 8 (2) ◽

pp. 200-207 ◽

Cited By ~ 1

Author(s):

Hidetoshi Miyao ◽

◽

Yasuaki Nakano ◽

Atsuhiko Tani ◽

Hirosato Tabaru ◽

...

Keyword(s):

Character Recognition ◽

Document Images ◽

Text Documents ◽

Matching Algorithm ◽

Majority Logic ◽

Character String ◽

Optical Character ◽

Japanese Character ◽

On Line ◽

Standard Character

This paper proposes two algorithms for maintaining matching between lines and characters in text documents output by multiple commercial optical character readers (OCRs). (1) a line matching algorithm using dynamic programming (DP) matching and (2) a character matching algorithm using character string division and standard character strings. The paper proposes a method that introduces majority logic and reject processing in character recognition. To demonstrate the feasibility of the method, we conducted experiments on line matching recognition for 127 document images using five commercial OCRs. Results demonstrated that the method extracted character areas with more accuracy than a single OCR along with appropriate line matching. The proposed method enhanced recognition from 97.61% provided by a single OCR to 98.83% in experiments using the character matching algorithm and character recognition. This method is expected to be highly useful in correcting locations at which unwanted lines or characters occur or required lines or characters disappear.

Download Full-text

Optical Character Recognition from Printed Text Images

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952175 ◽

2019 ◽

pp. 597-604 ◽

Cited By ~ 1

Author(s):

Dr. T. Kameswara Rao ◽

K. Yashwanth Chowdary ◽

I. Koushik Chowdary ◽

K. Prasanna Kumar ◽

Ch. Ramesh

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Corner Point ◽

Document Retrieval ◽

Document Images ◽

Image Region ◽

Text Documents ◽

Optical Character ◽

Simple Program ◽

Point Detection

In recent years, text extraction from document images is one of the most widely studied topics in Image Analysis and Optical Character Recognition. These extractions of document images can be used for document analysis, content analysis, document retrieval and many more. Many complex text extracting processes Maximization Likelihood (ML), Edge point detection, Corner point detection etc. are used to extract text documents from images. In this article, the corner point approach was used. To extract document from images we used a very simple approach based on FAST algorithm. Firstly, we divided the image into blocks and their density in each block was checked. The denser blocks were labeled as text blocks and the less dense were the image region or noise. Then we check the connectivity of the blocks to group the blocks so that the text part can be isolated from the image. This method is very fast and versatile, it can be used to detect various languages, handwriting and even images with a lot of noise and blur. Even though it is a very simple program the precision of this method is closer or higher than 90%. In conclusion, this method helps in more accurate and less complex detection of text from document images.

Download Full-text

Multi-Oriented Text Extraction in Stylistic Documents

International Journal of Image and Graphics ◽

10.1142/s0219467815500023 ◽

2015 ◽

Vol 15 (01) ◽

pp. 1550002

Author(s):

Brij Mohan Singh ◽

Rahul Sharma ◽

Debashis Ghosh ◽

Ankush Mittal

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Morphological Operations ◽

Document Images ◽

Font Size ◽

Text Extraction ◽

Optical Character ◽

Engineering Drawings ◽

Flood Fill

In many documents such as maps, engineering drawings and artistic documents, etc. there exist many printed as well as handwritten materials where text regions and text-lines are not parallel to each other, curved in nature, and having various types of text such as different font size, text and non-text areas lying close to each other and non-straight, skewed and warped text-lines. Optical character recognition (OCR) systems available commercially such as ABYY fine reader and Free OCR, are not capable of handling different ranges of stylistic document images containing curved, multi-oriented, and stylish font text-lines. Extraction of individual text-lines and words from these documents is generally not straight forward. Most of the segmentation works reported is on simple documents but still it remains a highly challenging task to implement an OCR that works under all possible conditions and gives highly accurate results, especially in the case of stylistic documents. This paper presents dilation and flood fill morphological operations based approach that extracts multi-oriented text-lines and words from the complex layout or stylistic document images in the subsequent stages. The segmentation results obtained from our method proves to be superior over the standard profiling-based method.

Download Full-text

Development of the documents comparison module for an electronic document management system

Information Technology and Nanotechnology ◽

10.18287/1613-0073-2019-2416-527-533 ◽

2019 ◽

pp. 527-533

Author(s):

M A Mikheev ◽

P Y Yakimov

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Management ◽

Electronic Document ◽

Text Documents ◽

Text Document ◽

Document Management System ◽

Optical Character ◽

Electronic Document Management ◽

Scanned Image

The article is devoted to solving the problem of document versions comparison in electronic document management systems. Systems-analogues were considered, the process of comparing text documents was studied. In order to recognize the text on the scanned image, the technology of optical character recognition and its implementation — Tesseract library were chosen. The Myers algorithm is applied to compare received texts. The software implementation of the text document comparison module was implemented using the solutions described above.

Download Full-text

A robust method for coarse classifier construction from a large number of basic recognizers for on-line handwritten Chinese/Japanese character recognition

Pattern Recognition ◽

10.1016/j.patcog.2013.08.011 ◽

2014 ◽

Vol 47 (2) ◽

pp. 685-693 ◽

Cited By ~ 6

Author(s):

Bilan Zhu ◽

Masaki Nakagawa

Keyword(s):

Character Recognition ◽

Robust Method ◽

Japanese Character ◽

On Line

Download Full-text

Using Informational Confidence Values for Classifier Combination: An Experiment with Combined On-Line/Off-Line Japanese Character Recognition

Ninth International Workshop on Frontiers in Handwriting Recognition ◽

10.1109/iwfhr.2004.108 ◽

2004 ◽

Cited By ~ 3

Author(s):

S. Jaeger

Keyword(s):

Character Recognition ◽

Classifier Combination ◽

Japanese Character ◽

On Line

Download Full-text

Automatic prototype stroke generation based on stroke clustering for on-line handwritten Japanese character recognition

Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318) ◽

10.1109/icdar.1999.791877 ◽

1999 ◽

Cited By ~ 4

Author(s):

K. Yamasaki

Keyword(s):

Character Recognition ◽

Japanese Character ◽

On Line

Download Full-text

A Survey on Arabic Handwritten Script Recognition Systems

International Journal of Artificial Intelligence and Machine Learning ◽

10.4018/ijaiml.20210701.oa9 ◽

2021 ◽

Vol 11 (2) ◽

pp. 1-17

Author(s):

Soumia Djaghbellou ◽

Abderraouf Bouziane ◽

Abdelouahab Attia ◽

Zahid Akhtar

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Research Field ◽

Research Directions ◽

Optical Character ◽

On Line ◽

Recognition Systems ◽

Active Research ◽

Handwritten Arabic ◽

Open Issues

The optical character recognition (OCR) system is still an active research field in pattern recognition. Such systems can identify, recognize and distinguish electronically between characters and texts, printed or handwritten. They can also do a transformation of such data type into machine-processable form to facilitate the interaction between user and machine in various applications. In this paper, we present the global structure of an OCR system, with its types (on-line and off-line), categories (printed and handwritten) and its main steps. We also focused on off-line handwritten Arabic character recognition and provided a list of the main datasets publicly available. This paper also presents a survey of the works that have been carried out over recent years. Finally, some open issues and potential research directions have been highlighted

Download Full-text

Corpus-based technique for improving Arabic OCR system

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i1.pp233-241 ◽

2021 ◽

Vol 21 (1) ◽

pp. 233

Author(s):

Ahmed Hussain Aliwy ◽

Basheer Al-Sadawi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Arabic Language ◽

Document Images ◽

Statistical Language Model ◽

Text Document ◽

Optical Character ◽

Arabic Ocr

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Download Full-text

Robust Combined Binarization Method of Non-Uniformly Illuminated Document Images for Alphanumerical Character Recognition

Sensors ◽

10.3390/s20102914 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2914

Author(s):

Hubert Michalak ◽

Krzysztof Okarma

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Recognition Accuracy ◽

Image Data ◽

Document Images ◽

Historical Document ◽

Image Binarization ◽

Optical Character ◽

Binarization Method ◽

Camera Sensors

Image binarization is one of the key operations decreasing the amount of information used in further analysis of image data, significantly influencing the final results. Although in some applications, where well illuminated images may be easily captured, ensuring a high contrast, even a simple global thresholding may be sufficient, there are some more challenging solutions, e.g., based on the analysis of natural images or assuming the presence of some quality degradations, such as in historical document images. Considering the variety of image binarization methods, as well as their different applications and types of images, one cannot expect a single universal thresholding method that would be the best solution for all images. Nevertheless, since one of the most common operations preceded by the binarization is the Optical Character Recognition (OCR), which may also be applied for non-uniformly illuminated images captured by camera sensors mounted in mobile phones, the development of even better binarization methods in view of the maximization of the OCR accuracy is still expected. Therefore, in this paper, the idea of the use of robust combined measures is presented, making it possible to bring together the advantages of various methods, including some recently proposed approaches based on entropy filtering and a multi-layered stack of regions. The experimental results, obtained for a dataset of 176 non-uniformly illuminated document images, referred to as the WEZUT OCR Dataset, confirm the validity and usefulness of the proposed approach, leading to a significant increase of the recognition accuracy.

Download Full-text