Robust Text Line, Word And Character Extraction from Telugu Document Image

Author(s):  
Vijaya Kumar Koppula ◽  
Negi Atul ◽  
Utpal Garain
2014 ◽  
Vol 23 (3) ◽  
pp. 245-260 ◽  
Author(s):  
Ram Sarkar ◽  
Nibaran Das ◽  
Subhadip Basu ◽  
Mahantapas Kundu ◽  
Mita Nasipuri

AbstractA novel piecewise water flow technique for text line extraction from multi-skewed document images of handwritten text of different scripts is presented here. The basic water flow technique assumes that the hypothetical water flows from both left and right sides of the image frame. This flow of water fills up the gaps between consecutive objects (texts) but faces obstruction if any object lies in the path of the flow. All unwetted regions in the document image are then labeled distinctly to extract the text lines. However, the technique fails when two neighboring text lines touch each other, as water gets obstructed by the touching segment(s). To get rid of this difficulty, we have modified the basic water flow technique by iteratively applying the same over the vertically segmented document images. The main purpose of this vertical segmentation is to localize the text line segment(s) where two text lines get joined. These segments are then horizontally fragmented, and each fragment is placed suitably to the text line in which it actually belongs to. This way, the probable data loss during isolation of the touching text line segment is minimized. Both the techniques (current and basic ones) have been tested on three different databases, viz., CMATERdb 1.1.1, CMATERdb 1.1.2, and ICDAR2009 handwritten segmentation contest pages, respectively. The test results show that the present technique outperforms the basic one for all three databases.


2014 ◽  
Vol 14 (01n02) ◽  
pp. 1450003 ◽  
Author(s):  
S. A. Angadi ◽  
M. M. Kodabagi

Reliable extraction/segmentation of text lines, words and characters is one of the very important steps for development of automated systems for understanding the text in low resolution display board images. In this paper, a new approach for segmentation of text lines, words and characters from Kannada text in low resolution display board images is presented. The proposed method uses projection profile features and on pixel distribution statistics for segmentation of text lines. The method also detects text lines containing consonant modifiers and merges them with corresponding text lines, and efficiently separates overlapped text lines as well. The character extraction process computes character boundaries using vertical profile features for extracting character images from every text line. Further, the word segmentation process uses k-means clustering to group inter character gaps into character and word cluster spaces, which are used to compute thresholds for extracting words. The method also takes care of variations in character and word gaps. The proposed methodology is evaluated on a data set of 1008 low resolution images of display boards containing Kannada text captured from 2 mega pixel cameras on mobile phones at various sizes 240 × 320, 480 × 640 and 960 × 1280. The method achieves text line segmentation accuracy of 97.17%, word segmentation accuracy of 97.54% and character extraction accuracy of 99.09%. The proposed method is tolerant to font variability, spacing variations between characters and words, absence of free segmentation path due to consonant and vowel modifiers, noise and other degradations. The experimentation with images containing overlapped text lines has given promising results.


2013 ◽  
Vol 64 (4) ◽  
pp. 238-243 ◽  
Author(s):  
Darko Brodić ◽  
Zoran N. Milivojević

The paper presents the algorithm for text line segmentation based on the oriented anisotropic Gaussian kernel. Initially, the document image is split into connected components achieved by bounding boxes. These connected components are cleared from redundant fragments. Furthermore, the binary moments are applied to each of these connected components evaluating local text skewing. According to this information the orientation of the anisotropic Gaussian kernel is set. After the algorithm application the boundary growing areas around connected components are established. These areas are of major importance for the evaluation of text line segmentation. For testing purposes, the algorithm is evaluated under different text samples. Comparative analysis between algorithm with and without orientation based on the anisotropic Gaussian kernel is made. The results show the improvement in the domain of text line segmentation.


2015 ◽  
Vol 66 (3) ◽  
pp. 132-141 ◽  
Author(s):  
Darko Brodić

Abstarct This manuscript proposes an extension to the water flow algorithm for text line segmentation. Basic algorithm assumes hypothetical water flows under few specified angles of the document image frame from left to right and vice versa. As a result, unwetted image regions that incorporate text are extracted. These regions are of the major importance for text line segmentation. The extension of the basic algorithm means modification of water flow function that creates the unwetted region. Hence, the linear water flow function used in the basic algorithm is changed with its power function counterpart. Extended method was tested, examined and evaluated under different text samples. Results are encouraging due to improving text line segmentation which is a key process stage.


Sign in / Sign up

Export Citation Format

Share Document