BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

Author(s):  
Shibaprasad Sen ◽  
Ankan Bhattacharyya ◽  
Ram Sarkar ◽  
Kaushik Roy

The work reported in this article deals with the ground truth generation scheme for online handwritten Bangla documents at text-line, word, and stroke levels. The aim of the proposed scheme is twofold: firstly, to build a document level database so that future researchers can use the database to do research in this field. Secondly, the ground truth information will help other researchers to evaluate the performance of their algorithms developed for text-line extraction, word extraction, word segmentation, stroke recognition, and word recognition. The reported ground truth generation scheme starts with text-line extraction from the online handwritten Bangla documents, then words extraction from the text-lines, and finally segmentation of those words into basic strokes. After word segmentation, the basic strokes are assigned appropriate class labels by using modified distance-based feature extraction procedure and the MLP ( Multi-layer Perceptron ) classifier. The Unicode for the words are then generated from the sequence of stroke labels. XML files are used to store the stroke, word, and text-line levels ground truth information for the corresponding documents. The proposed system is semi-automatic and each step such as text-line extraction, word extraction, word segmentation, and stroke recognition has been implemented by using different algorithms. Thus, the proposed ground truth generation procedure minimizes huge manual intervention by reducing the number of mouse clicks required to extract text-lines, words from the document, and segment the words into basic strokes. The integrated stroke recognition module also helps to minimize the manual labor needed to assign appropriate stroke labels. The freely available and can be accessed at https://byanjon.herokuapp.com/ .

2014 ◽  
Vol 23 (3) ◽  
pp. 245-260 ◽  
Author(s):  
Ram Sarkar ◽  
Nibaran Das ◽  
Subhadip Basu ◽  
Mahantapas Kundu ◽  
Mita Nasipuri

AbstractA novel piecewise water flow technique for text line extraction from multi-skewed document images of handwritten text of different scripts is presented here. The basic water flow technique assumes that the hypothetical water flows from both left and right sides of the image frame. This flow of water fills up the gaps between consecutive objects (texts) but faces obstruction if any object lies in the path of the flow. All unwetted regions in the document image are then labeled distinctly to extract the text lines. However, the technique fails when two neighboring text lines touch each other, as water gets obstructed by the touching segment(s). To get rid of this difficulty, we have modified the basic water flow technique by iteratively applying the same over the vertically segmented document images. The main purpose of this vertical segmentation is to localize the text line segment(s) where two text lines get joined. These segments are then horizontally fragmented, and each fragment is placed suitably to the text line in which it actually belongs to. This way, the probable data loss during isolation of the touching text line segment is minimized. Both the techniques (current and basic ones) have been tested on three different databases, viz., CMATERdb 1.1.1, CMATERdb 1.1.2, and ICDAR2009 handwritten segmentation contest pages, respectively. The test results show that the present technique outperforms the basic one for all three databases.


2014 ◽  
Vol 35 ◽  
pp. 23-33 ◽  
Author(s):  
Raid Saabni ◽  
Abedelkadir Asi ◽  
Jihad El-Sana

Sign in / Sign up

Export Citation Format

Share Document