Which OCR toolset is good and why? A comparative study

Pooja Jain;  ; Dr. Kavita Taneja; Dr. Harmunish Taneja;  ;

doi:10.48129/kjs.v48i2.9589

Which OCR toolset is good and why? A comparative study

Kuwait Journal of Science ◽

10.48129/kjs.v48i2.9589 ◽

2021 ◽

Vol 48 (2) ◽

Author(s):

Pooja Jain ◽

◽

Dr. Kavita Taneja ◽

Dr. Harmunish Taneja ◽

◽

...

Keyword(s):

Comparative Study ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Research Area ◽

Real World Applications ◽

Banking Education ◽

Active Research ◽

Active Research Area ◽

Computational Technology

Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.

Download Full-text

Word and Chracter Segmentation in Devnagari and Odia Script – A Comparitive Analysis

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7060.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 377-382

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Research Area ◽

Character Segmentation ◽

National Language ◽

Eastern India ◽

Regional Language ◽

Optical Character ◽

Active Research ◽

Active Research Area

Optical Character Recognition has been an active research area in computer science for several years. Several research works undertaken on various languages in India. In this paper an attempt has been made to find out the percentage of accuracy in word and character segmentation of Hindi (National language of India) and Odia is one of the Regional Language mostly spoken in Odisha and a few Eastern India states. A comparative article has been published under this article. 10 sets of each printed Odia and Devanagari scripts with different word limits were used in this study. The documents were scanned at 300dpi before adopting pre-processing and segmentation procedure. The result shows that the percentage of accuracy both in word and character segmentation is higher in Odia language as compared to Hindi language. One of the reasons is the use of headers line in Hindi which makes the segmentation process cumbersome. Thus, it can be concluded that the accuracy level can vary from one language to the other and from word segmentation to that of the character segmentation.

Download Full-text

Machine Translation

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.26 ◽

2016 ◽

Author(s):

Lucia Specia ◽

Yorick Wilks

Keyword(s):

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Research Area ◽

Rule Based ◽

Active Research ◽

Translation Methods ◽

The Cost ◽

Translation Systems ◽

Active Research Area

Machine Translation (MT) is and always has been a core application in the field of natural-language processing. It is a very active research area and it has been attracting significant commercial interest, most of which has been driven by the deployment of corpus-based, statistical approaches, which can be built in a much shorter time and at a fraction of the cost of traditional, rule-based approaches, and yet produce translations of comparable or superior quality. This chapter aims at introducing MT and its main approaches. It provides a historical overview of the field, an introduction to different translation methods, both rationalist (rule-based) and empirical, and a more in depth description of state-of-the-art statistical methods. Finally, it covers popular metrics to evaluate the output of machine translation systems.

Download Full-text

An Arabic Dialects Dictionary Using Word Embeddings

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2019070102 ◽

2019 ◽

Vol 6 (3) ◽

pp. 18-31

Author(s):

Azroumahli Chaimae ◽

Yacine El Younoussi ◽

Otman Moussaoui ◽

Youssra Zahidi

Keyword(s):

Language Processing ◽

Arabic Language ◽

Research Area ◽

Word Embeddings ◽

Language Resources ◽

Data Set ◽

Social Media Platforms ◽

Active Research ◽

Active Research Area ◽

Modern Standard

The dialectical Arabic and the Modern Standard Arabic lacks sufficient standardized language resources to enable the tasks of Arabic language processing, despite it being an active research area. This work addresses this issue by firstly highlighting the steps and the issues related to building a multi Arabic dialect corpus using web data from blogs and social media platforms (i.e. Facebook, Twitter, etc.). This is to create a vectorized dictionary for the crawled data using the word Embeddings. In other terms, the goal of this article is to build an updated multi-dialect data set, and then, to extract an annotated corpus from it.

Download Full-text

A Person-Centered Design Framework for Serious Games for Dementia

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/1071181320641005 ◽

2020 ◽

Vol 64 (1) ◽

pp. 18-22

Author(s):

Bella Yigong Zhang ◽

Mark Chignell

Keyword(s):

Theoretical Basis ◽

Serious Games ◽

Aging Population ◽

Research Area ◽

Design Framework ◽

Strategic Design ◽

Health And Wellbeing ◽

Initial Design ◽

Active Research ◽

Active Research Area

With the rapidly aging population and the rising number of people living with dementia (PLWD), there is an urgent need for programming and activities that can promote the health and wellbeing of PLWD. Due to staffing and budgetary constraints, there is considerable interest in using technology to support this effort. Serious games for dementia have become a very active research area. However, much of the work is being done without a strong theoretical basis. We incorporate a Montessori approach with highly tactile interactions. We have developed a person-centered design framework for serious games for dementia with initial design recommendations. This framework has the potential to facilitate future strategic design and development in the field of serious games for dementia.

Download Full-text

ARGO, Automatic Record Generator for Oncology: a natural language process-based tool to capture pathology features from onco-hematological reports (Preprint)

10.2196/preprints.27295 ◽

2021 ◽

Author(s):

Gian Maria Zaccaria ◽

Vito Colella ◽

Simona Colucci ◽

Felice Clemente ◽

Fabio Pavone ◽

...

Keyword(s):

Natural Language ◽

Translational Research ◽

Language Processing ◽

Character Recognition ◽

Web Application ◽

Optical Character Recognition ◽

Anatomical Site ◽

Cell Of Origin ◽

Molecular Features ◽

Clinical And Translational Research

BACKGROUND The unstructured nature of medical data from Real-World (RW) patients and the scarce accessibility for researchers to integrated systems restrain the use of RW information for clinical and translational research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports in electronic health records (EHR), thus prompting their standardization and sharing. OBJECTIVE We aimed at designing a tool to capture pathological features directly from hemo-lymphopathology reports and automatically record them into electronic case report forms (eCRFs). METHODS We exploited Optical Character Recognition and NLP techniques to develop a web application, named ARGO (Automatic Record Generator for Oncology), that recognizes unstructured information from diagnostic paper-based reports of diffuse large B-cell lymphomas (DLBCL), follicular lymphomas (FL), and mantle cell lymphomas (MCL). ARGO was programmed to match data with standard diagnostic criteria of the National Institute of Health, automatically assign diagnosis and, via Application Programming Interface, populate specific eCRFs on the REDCap platform, according to the College of American Pathologists templates. A selection of 239 reports (n. 106 DLBCL, n.79 FL, and n. 54 MCL) from the Pathology Unit at the IRCCS - Istituto Tumori “Giovanni Paolo II” of Bari (Italy) was used to assess ARGO performance in terms of accuracy, precision, recall and F1-score. RESULTS By applying our workflow, we successfully converted 233 paper-based reports into corresponding eCRFs incorporating structured information about diagnosis, tissue of origin and anatomical site of the sample, major molecular markers and cell-of-origin subtype. Overall, ARGO showed high performance (nearly 90% of accuracy, precision, recall and F1-score) in capturing identification report number, biopsy date, specimen type, diagnosis, and additional molecular features. CONCLUSIONS We developed and validated an easy-to-use tool that converts RW paper-based diagnostic reports of major lymphoma subtypes into structured eCRFs. ARGO is cheap, feasible, and easily transferable into the daily practice to generate REDCap-based EHR for clinical and translational research purposes.

Download Full-text

A Finding Aid for The Equity

Inquiry@Queen's Undergraduate Research Conference Proceedings ◽

10.24908/iqurcp.11597 ◽

2018 ◽

Author(s):

Jeff Blackadar

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Collection ◽

Digital History ◽

R Language ◽

Potential Value ◽

Optical Character ◽

Text Searching ◽

Person Location

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it. The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files. Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/

Download Full-text

Digitization and Visualization of Folk Dances in Cultural Heritage: A Review

Inventions ◽

10.3390/inventions3040072 ◽

2018 ◽

Vol 3 (4) ◽

pp. 72 ◽

Cited By ~ 6

Author(s):

Iris Kico ◽

Nikos Grammalidis ◽

Yiannis Christidis ◽

Fotis Liarokapis

Keyword(s):

Performance Evaluation ◽

Cultural Heritage ◽

Computer Science ◽

Learning Process ◽

Research Area ◽

Active Research ◽

Active Research Area

According to UNESCO, cultural heritage does not only include monuments and collections of objects, but also contains traditions or living expressions inherited from our ancestors and passed to our descendants. Folk dances represent part of cultural heritage and their preservation for the next generations appears of major importance. Digitization and visualization of folk dances form an increasingly active research area in computer science. In parallel to the rapidly advancing technologies, new ways for learning folk dances are explored, making the digitization and visualization of assorted folk dances for learning purposes using different equipment possible. Along with challenges and limitations, solutions that can assist the learning process and provide the user with meaningful feedback are proposed. In this paper, an overview of the techniques used for the recording of dance moves is presented. The different ways of visualization and giving the feedback to the user are reviewed as well as ways of performance evaluation. This paper reviews advances in digitization and visualization of folk dances from 2000 to 2018.

Download Full-text

Enhancing Big Data Auditing

Computer and Information Science ◽

10.5539/cis.v11n1p90 ◽

2018 ◽

Vol 11 (1) ◽

pp. 90

Author(s):

Sara Alomari ◽

Mona Alghamdi ◽

Fahd S. Alotaibi

Keyword(s):

Big Data ◽

Research Area ◽

Provable Data Possession ◽

Integrity Verification ◽

The Core ◽

Outsourced Data ◽

Active Research ◽

Data Auditing ◽

Active Research Area ◽

Proof Of Retrievability

The auditing services of the outsourced data, especially big data, have been an active research area recently. Many schemes of remotely data auditing (RDA) have been proposed. Both categories of RDA, which are Provable Data Possession (PDP) and Proof of Retrievability (PoR), mostly represent the core schemes for most researchers to derive new schemes that support additional capabilities such as batch and dynamic auditing. In this paper, we choose the most popular PDP schemes to be investigated due to the existence of many PDP techniques which are further improved to achieve efficient integrity verification. We firstly review the work of literature to form the required knowledge about the auditing services and related schemes. Secondly, we specify a methodology to be adhered to attain the research goals. Then, we define each selected PDP scheme and the auditing properties to be used to compare between the chosen schemes. Therefore, we decide, if possible, which scheme is optimal in handling big data auditing.

Download Full-text

MELHISSA: a multilingual entity linking architecture for historical press articles

International Journal on Digital Libraries ◽

10.1007/s00799-021-00319-6 ◽

2021 ◽

Author(s):

Elvys Linhares Pontes ◽

Luis Adrián Cabrera-Diego ◽

Jose G. Moreno ◽

Emanuela Boros ◽

Ahmed Hamdi ◽

...

Keyword(s):

Language Processing ◽

Digital Libraries ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Entity Linking ◽

Named Entities ◽

European Languages ◽

Meta Information ◽

The Impact

AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical documents covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Download Full-text

E-Mail Usage in South Pacific Distance Education

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch183 ◽

2005 ◽

pp. 1034-1039

Author(s):

Jonathan Frank ◽

Janet Toland ◽

Karen D. Schenk

Keyword(s):

Learning Styles ◽

Research Area ◽

Web Based ◽

Group Interactions ◽

Face To Face ◽

Web Based Learning ◽

Active Research ◽

The Impact ◽

E Mail ◽

Active Research Area

The impact of cultural diversity on group interactions through technology is an active research area. Current research has found that a student’s culture appears to influence online interactions with teachers and other students (Freedman & Liu, 1996). Students from Asian and Western cultures have different Web-based learning styles (Liang & McQueen, 1999), and Scandinavian students demonstrate a more restrained online presence compared to their more expressive American counterparts (Bannon, 1995). Differences were also found across cultures in online compared to face-to-face discussions (Warschauer, 1996). Student engagement, discourse, and interaction are valued highly in “western” universities. With growing internationalization of western campuses, increasing use of educational technology both on and off campus, and rising distance learning enrollments, intercultural frictions are bound to increase.

Download Full-text