Optical character recognition system for Baybayin scripts using support vector machine

PeerJ Computer Science ◽

10.7717/peerj-cs.360 ◽

2021 ◽

Vol 7 ◽

pp. e360

Author(s):

Rodney Pino ◽

Renier Mendoza ◽

Rachelle Sambayan

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

The Philippines ◽

Support Vector ◽

Classification Problems ◽

Script Recognition ◽

Individual Character ◽

Latin Script ◽

Character Classification

In 2018, the Philippine Congress signed House Bill 1022 declaring the Baybayin script as the Philippines’ national writing system. In this regard, it is highly probable that the Baybayin and Latin scripts would appear in a single document. In this work, we propose a system that discriminates the characters of both scripts. The proposed system considers the normalization of an individual character to identify if it belongs to Baybayin or Latin script and further classify them as to what unit they represent. This gives us four classification problems, namely: (1) Baybayin and Latin script recognition, (2) Baybayin character classification, (3) Latin character classification, and (4) Baybayin diacritical marks classification. To the best of our knowledge, this is the first study that makes use of Support Vector Machine (SVM) for Baybayin script recognition. This work also provides a new dataset for Baybayin, its diacritics, and Latin characters. Classification problems (1) and (4) use binary SVM while (2) and (3) apply the multiclass SVM classification. On average, our numerical experiments yield satisfactory results: (1) has 98.5% accuracy, 98.5% precision, 98.49% recall, and 98.5% F1 Score; (2) has 96.51% accuracy, 95.62% precision, 95.61% recall, and 95.62% F1 Score; (3) has 95.8% accuracy, 95.85% precision, 95.8% recall, and 95.83% F1 Score; and (4) has 100% accuracy, 100% precision, 100% recall, and 100% F1 Score.

Download Full-text

A feature fusion based optical character recognition of Bangla characters using support vector machine

2017 3rd International Conference on Electrical Information and Communication Technology (EICT) ◽

10.1109/eict.2017.8275138 ◽

2017 ◽

Cited By ~ 2

Author(s):

Mst. Tasnim Pervin ◽

Shyla Afroge ◽

Aminul Huq

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Feature Fusion ◽

Support Vector ◽

Optical Character

Download Full-text

Optical Character Recognition Based on Least Square Support Vector Machine

2009 Third International Symposium on Intelligent Information Technology Application ◽

10.1109/iita.2009.327 ◽

2009 ◽

Cited By ~ 8

Author(s):

Jianhong Xie

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Least Square ◽

Support Vector ◽

Optical Character

Download Full-text

A new tree-like fuzzy binary support vector machine for optical character recognition

10.1117/12.572086 ◽

2005 ◽

Author(s):

Guo-yun Zhang ◽

Jing Zhang

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Support Vector ◽

Optical Character

Download Full-text

Printed Arabic optical character recognition using support vector machine

2017 International Conference on Mathematics and Information Technology (ICMIT) ◽

10.1109/mathit.2017.8259707 ◽

2017 ◽

Author(s):

Ouled Jaafri Yamina ◽

Mamouni El Mamoun ◽

Sadouni Kaddour

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Support Vector ◽

Optical Character

Download Full-text

A novel framework for Farsi and latin script identification and Farsi handwritten digit recognition

Journal of Automatic Control ◽

10.2298/jac1001017b ◽

2010 ◽

Vol 20 (1) ◽

pp. 17-25 ◽

Cited By ~ 3

Author(s):

Alireza Behrad ◽

Malike Khoddami ◽

Mehdi Salehpour

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Scale Space ◽

Connected Components ◽

Support Vector ◽

Script Language ◽

Scale Invariant ◽

Script Identification ◽

Digital Format ◽

Latin Script

Optical character recognition is an important task for converting handwritten and printed documents to digital format. In multilingual systems, a necessary process before OCR algorithm is script identification. In this paper novel methods for the script language identification and the recognition of Farsi handwritten digits are proposed. Our method for script identification is based on curvature scale space features. The proposed features are rotation and scale invariant and can be used to identify scripts with different fonts. We assumed that the bilingual scripts may have Farsi and English words and characters together; therefore the algorithm is designed to be able to recognize scripts in the connected components level. The output of the recognition is then generalized to word, line and page levels. We used cluster based weighted support vector machine for the classification and recognition of Farsi handwritten digits that is reasonably robust against rotation and scaling. The algorithm extracts the required features using principle component analysis (PCA) and linear discrimination analysis (LDA) algorithms. The extracted features are then classified using a new classification algorithm called cluster based weighted SVM (CBWSVM). The experimental results showed the promise of the algorithms.

Download Full-text

Klasifikasi Analisis Sentimen Pada Gambar Meme Politik Dengan Library Tesseract Dan Algoritme Support vector machine

Journal of Informatics Information System Software Engineering and Applications (INISTA) ◽

10.20895/inista.v2i1.96 ◽

2019 ◽

Vol 2 (1) ◽

pp. 56-64

Author(s):

Eko Sanjaya ◽

Agi Prasetiadi ◽

WAHYU ANDI SAPUTRA

Keyword(s):

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Support Vector ◽

Optical Character ◽

Non Linear

Meme merupakan penyebaran informasi dalam bentuk gambar. Berdasarkan data yang diperoleh, pengembangan meme mulai meningkat menjelang pemilu 2019. Informasi yang diperoleh dari meme politik beragam. Salah satunya memberikan dukungan untuk suatu partai atau tokoh politik atau digunakan untuk mengkritik / mencaci-maki partai politik atau tokoh. Sehingga diperlukan suatu sistem yang dapat mengklasifikasikan meme berdasarkan kelas Penelitian ini bertujuan untuk menciptakan sistem yang dapat mengklasifikasikan meme politik berdasarkan kelas. Algoritma yang akan digunakan dalam mengklasifikasikan adalah Support vector macine (SVM) dengan ekstraksi fitur TF-IDF. Library yang akan digunakan dalam optical character recognition (OCR) adalah Tesseract. Berdasarkan hasil pengujian diketahui bahwa akurasi yang dihasilkan oleh SVM linier lebih baik daripada SVM non-linear. Akurasi terbaik dalam SVM linear dengan kombinasi TF-IDF adalah 75.71%.

Download Full-text

Comparison Between Neural Network and Support Vector Machine in Optical Character Recognition

Procedia Computer Science ◽

10.1016/j.procs.2017.10.061 ◽

2017 ◽

Vol 116 ◽

pp. 351-357 ◽

Cited By ~ 10

Author(s):

Michael Reynaldo Phangtriastu ◽

Jeklin Harefa ◽

Dian Felita Tanoto

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Character Recognition ◽

Optical Character Recognition ◽

Support Vector ◽

Optical Character

Download Full-text

PERANCANGAN DAN IMPLEMENTASI DIRECTIONAL FEATURE EXTRACTION DAN SUPPORT VECTOR MACHINES UNTUK MENERJEMAHKAN KATA DENGAN PENGENALAN HURUF HIRAGANA DALAM BAHASA JEPANG KE BAHASA INDONESIA BERBASIS ANDROID

TEKTRIKA - Jurnal Penelitian dan Pengembangan Telekomunikasi Kendali Komputer Elektrik dan Elektronika ◽

10.25124/tektrika.v2i1.1658 ◽

2018 ◽

Vol 2 (1) ◽

Author(s):

Fardilla Zardi Putri ◽

Budhi Irawan ◽

Umar Ali Ahmad

Keyword(s):

Support Vector Machine ◽

Feature Extraction ◽

Support Vector Machines ◽

Character Recognition ◽

Optical Character Recognition ◽

Support Vector ◽

Optical Character ◽

Vector Machines ◽

Bahasa Indonesia

Pada era global ini menguasai bahasa selain bahasa Indonesia merupakan salah satu kebutuhan penting yang harus dimiliki setiap orang. Banyak orang berkunjung ke negara lain untuk melakukan banyak kegiatan seperti bekerja, belajar, bahkan berlibur. Salah satu negara yang banyak dikunjungi adalah negara Jepang. Negara Jepang memiliki bentuk huruf yang berbeda dengan huruf latin pada umumnya. Untuk mempelajari bahasa Jepang tersebut dibutuhkan pemahaman dengan huruf-hurufnya. Seiring dengan berkembangnya teknologi, pengenalan karakter atau sering Optical Character Recognition (OCR) merupakan salah satu aplikasi teknologi pada bidang pengenalan karakter atau pola dan kecerdasan buatan sebagai mesin pembaca. Pada penelitian ini, akan dirancang sebuah aplikasi penerjemah kata dalam bahasa Jepang berbasis Android dengan memanfaatkan prinsip dasar OCR dengan menggunakan metode Directional Feature Extraction dan Support Vector Machine. Pengujian yang dilakukan memberikan hasil terbaik pada nilai akurasi yang dicapai dengan menggunakan metode Directional Feature Extraction dan Support Vector Machine adalah 85,71%. Pada penelitian ini, menggunakan 104 data latih. Hasil pengujian Beta atas empat poin, yaitu tampilan aplikasi, waktu respons sistem, ketepatan penerjemahan, dan manfaat aplikasi menunjukkan aplikasi dapat diklasifikasikan baik.

Download Full-text

A Hybrid Swarm and Gravitation-based feature selection algorithm for handwritten Indic script classification problem

Complex & Intelligent Systems ◽

10.1007/s40747-020-00237-1 ◽

2021 ◽

Author(s):

Ritam Guha ◽

Manosij Ghosh ◽

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Feature Selection ◽

Character Recognition ◽

Optical Character Recognition ◽

Classification Problem ◽

Classification Model ◽

Support Vector ◽

Intermediate Step ◽

Hybrid Swarm ◽

Feature Vectors ◽

Indic Script

AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.

Download Full-text

A Structural Analysis Based Feature Extraction Method for OCR System For Myanmar Printed Document Images

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2012010102 ◽

2012 ◽

Vol 2 (1) ◽

pp. 16-41 ◽

Cited By ~ 1

Author(s):

Htwe Pa Pa Win ◽

Phyo Thu Thu Khine ◽

Khin Nwe Ni Tun

Keyword(s):

Feature Extraction ◽

Structural Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Extraction Method ◽

Recognition Performance ◽

Extraction Methods ◽

Support Vector ◽

Svm Classifier ◽

Feature Extraction Method

This paper proposes a new feature extraction method for off-line recognition of Myanmar printed documents. One of the most important factors to achieve high recognition performance in Optical Character Recognition (OCR) system is the selection of the feature extraction methods. Different types of existing OCR systems used various feature extraction methods because of the diversity of the scripts’ natures. One major contribution of the work in this paper is the design of logically rigorous coding based features. To show the effectiveness of the proposed method, this paper assumed the documents are successfully segmented into characters and extracted features from these isolated Myanmar characters. These features are extracted using structural analysis of the Myanmar scripts. The experimental results have been carried out using the Support Vector Machine (SVM) classifier and compare the pervious proposed feature extraction method.

Download Full-text