TreatmentBank: Plazi's strategies and its implementation to most efficiently liberate data from scholarly publications

Biodiversity Information Science and Standards ◽

10.3897/biss.5.75690 ◽

2021 ◽

Vol 5 ◽

Author(s):

Marcus Guidoti ◽

Carolina Sokolowicz ◽

Felipe Simoes ◽

Valdenar Gonçalves ◽

Tatiana Ruschel ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Data Extraction ◽

Batch Process ◽

Portable Document Format ◽

Extraction Process ◽

Current Data ◽

Cloud Provider ◽

Efficient Operation ◽

Granularity Level

Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed. Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner. In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be. Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy. Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole. In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation.

Download Full-text

Character Recognition And Extraction of An Indian State License Plates Using K-NN

10.35543/osf.io/urcw5 ◽

2019 ◽

Author(s):

Rajasekhar Ponakala ◽

Hari Krishna Adda ◽

Ch. Aravind Kumar ◽

Kavya Avula ◽

K. Anitha Sheela

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Nearest Neighbor ◽

Extraction Process ◽

License Plate ◽

K Nearest Neighbor ◽

Open Hardware ◽

Character Extraction ◽

Indian State ◽

High Level

License plate recognition is an application-specific optimization in Optical Character Recognition (OCR) software which enables computer systems to read automatically the License Plates of vehicles from digital images. This thesis discusses the character extraction from the respective License Plates of vehicles and problems in the character extraction process. An OCR based training algorithm named k-nearest neighbor with predefined OpenCV libraries is implemented and evaluated in the BeagleBone Black Open Hardware. In an OCR, the character extraction involves certain steps which include Image acquisition, Pre-processing, Feature extraction, Detection/ Segmentation, High-level processing, Decision making. A key advantage of the method is that it is a fairly straightforward technique which utilizes from k-nearest neighbor algorithm segments normalized result as a format in text. The results show that training an image with this algorithm gives better results when compared with other algorithms.

Download Full-text

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Gastrointestinal Endoscopy ◽

10.1016/j.gie.2020.08.038 ◽

2020 ◽

Author(s):

Sobia Nasir Laique ◽

Umar Hayat ◽

Shashank Sarvepalli ◽

Byron Vaughn ◽

Mounir Ibrahim ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Scale ◽

Data Extraction ◽

Quality Metric ◽

Optical Character

Download Full-text

Mo1076 Validation of a Hybrid Natural Language Processing Tool Utilizing Optical Character Recognition for Data Extraction From Scanned Colonoscopy Reports

Gastrointestinal Endoscopy ◽

10.1016/j.gie.2017.03.968 ◽

2017 ◽

Vol 85 (5) ◽

pp. AB417-AB418 ◽

Cited By ~ 2

Author(s):

Umar Hayat ◽

Mahmoud Isseh ◽

Nazih Isseh ◽

Mounir Ibrahim ◽

John McMichael ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Data Extraction ◽

Optical Character ◽

Natural Language Processing Tool

Download Full-text

IMPLEMENTASI ALGORITMA YOLO DAN TESSERACT OCR PADA SISTEM DETEKSI PLAT NOMOR OTOMATIS

Jurnal Teknoinfo ◽

10.33365/jti.v16i1.1522 ◽

2022 ◽

Vol 16 (1) ◽

pp. 54

Author(s):

Imam Husni Al amin ◽

Awan Aprilino

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Detection System ◽

Extraction Process ◽

Test Results ◽

Web Based ◽

Detection Systems ◽

Human Labor ◽

Cross Platform ◽

Human Effort

Currently, vehicle number plate detection systems in general still use the manual method. This will take a lot of time and human effort. Thus, an automatic vehicle number plate detection system is needed because the number of vehicles that continues to increase will burden human labor. In addition, the methods used for vehicle number plate detection still have low accuracy because they depend on the characteristics of the object being used. This study develops a YOLO-based automatic vehicle number plate detection system. The dataset used is a pretrained YOLOv3 model of 700 data. Then proceed with the number plate text extraction process using the Tesseract Optical Character Recognition (OCR) library and the results obtained will be stored in the database. This system is web-based and API so that it can be used online and on the cross-platform. The test results show that the automatic number plate detection system reaches 100% accuracy with sufficient lighting and a threshold of 0.5 and for the results using the Tesseract library, the detection results are 92.32% where the system is successful in recognizing all characters on the license plates of cars and motorcycles. in the form of Alphanumeric characters of 7-8 characters.

Download Full-text

Character Recognition And Extraction of An Indian State License Plates Using K-NN

10.31237/osf.io/tsqug ◽

2019 ◽

Author(s):

Rajasekhar Ponakala ◽

Hari Krishna Adda ◽

Ch. Aravind Kumar ◽

Kavya Avula ◽

K. Anitha Sheela

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Nearest Neighbor ◽

Extraction Process ◽

License Plate ◽

K Nearest Neighbor ◽

Open Hardware ◽

Character Extraction ◽

Indian State ◽

High Level

Download Full-text

Data Extraction from Images through OCR

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.37377 ◽

2021 ◽

Vol 9 (VIII) ◽

pp. 435-437

Author(s):

Anurag Tiwari

Keyword(s):

Small Businesses ◽

Character Recognition ◽

Optical Character Recognition ◽

Human Error ◽

Data Extraction ◽

Data Accessibility ◽

Daily Lives ◽

Whole Process ◽

Optical Character ◽

User Data

The paperwork used in maintaining various types of documents in our daily lives is tiresome and inefficient, it consumes a lot of time and it is difficult to maintain and remember the concerned documents. This project provides a solution to these problems by introducing Optical Character Recognition Technology (OCR) which runs on Tesseract OCR Engine. The project specifically aims at increasing data accessibility, usability and improving customer experience by decreasing the time spent to process, save, and maintain user data. Another objective of this project is to nullify the human error, which is huge in manual handling of data records, the software used in the solution uses certain techniques to minimize these errors. Optical Character Recognition (OCR) is used for extracting texts and characters from an image. This helps us in maintaining our records and data digitally and securely. In this project we are using the Tesseract OCR Engine which has high accuracy rates for clean images. We have implemented a web version of OCR which runs on TesseractJS; other JavaScript frameworks are also used. The outcome of the project is that it is able successfully to extract text and characters from the provided image using Tesseract OCR Engine. It is observed that for the high resolution images the accuracy is above 90%. This web based application is useful for small businesses as they don’t have to install any extra software, all it needs is a file to be uploaded on an online interface making them able to access remotely. It will also help students to save notes and documents online which will make their important documents easily accessible on the web. This whole process is time and memory efficient.

Download Full-text

Use of systematic reviews to justify the conduct of urology clinical trials

10.21203/rs.3.rs-33014/v1 ◽

2020 ◽

Author(s):

Samuel Shepard ◽

Audrey Wise ◽

Bradley S. Johnson ◽

Matt Vassar

Keyword(s):

Systematic Review ◽

Clinical Trials ◽

Systematic Reviews ◽

Randomized Clinical Trials ◽

Data Extraction ◽

Statistical Significance ◽

Extraction Process ◽

Current Data ◽

Research Waste ◽

Pubmed Search

Abstract Background Given the increased amount of research being funded in the field of urology, reducing the amount of research waste is vital. Systematic reviews are an essential tool in aiding in reducing waste in research; they are a comprehensive summary of the current data on a clinical question. The aim of this study is to evaluate the use of systematic reviews as justification in conducting randomized clinical trials (RCT) in high impact urology journals. Methods On December 13, 2019, one of us (BJ) conducted a PubMed search for randomized controlled trials published in the top four urology journals according to their Google Scholar h5-index. Using a masked data extraction process each RCT was searched for systematic reviews. Then each review was evaluated for if it was justification for conducting the trial based on the context the systematic review was used.Results Of the 566 articles retrieved 281 were included. Overall 60.5% (170/281) trials cited a systematic review. We found only 47.6% (134/281) studies cited a systematic review as “verbatim” justification for conducting the trial. Regression analysis yielded a finding of statistical significance in showing a correlation of studies over medical devices were more likely to cite a systematic review than other study topics ( adjusted odds ratio 2.01, 95% CI, 1.08 - 3.73) A total of 409 different systematic review citations were recorded in the 281 trials.Conclusion Less than half of clinical trials cited a systematic review as justification for conducting the trial. If clinical trials were required to support their studies with systematic reviews we believe this would greatly reduce the amount of research waste within clinical research.

Download Full-text

THE COMPARISONS OF OCR TOOLS: A CONVERSION CASE IN THE MALAYSIAN HANSARD CORPUS DEVELOPMENT

MALAYSIAN JOURNAL OF COMPUTING ◽

10.24191/mjoc.v4i2.5626 ◽

2019 ◽

Vol 4 (2) ◽

pp. 335

Author(s):

Anis Nadiah Che Abdul Rahman ◽

Imran Ho Abdullah ◽

Intan Safinaz Zainuddin ◽

Azhar Jaludin

Keyword(s):

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Computer Software ◽

Portable Document Format ◽

Error Rates ◽

Plain Text ◽

Machine Readable ◽

Conversion Tool ◽

Computational Technology

Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.

Download Full-text

Our journey to digital curation of the Jeghers Medical Index

Journal of the Medical Library Association JMLA ◽

10.5195/jmla.2017.47 ◽

2017 ◽

Vol 105 (3) ◽

Author(s):

Lori Gawdyda ◽

Kimbroe Carter ◽

Mark Willson ◽

Denise Bedford

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Portable Document Format ◽

Digital Curation ◽

Medical Library ◽

Search Service ◽

Medical Index ◽

Nineteenth And Twentieth Centuries ◽

Medical Articles

Background: Harold Jeghers, a well-known medical educator of the twentieth century, maintained a print collection of about one million medical articles from the late 1800s to the 1990s. This case study discusses how a print collection of these articles was transformed to a digital database.Case Presentation: Staff in the Jeghers Medical Index, St. Elizabeth Youngstown Hospital, converted paper articles to Adobe portable document format (PDF)/A-1a files. Optical character recognition was used to obtain searchable text. The data were then incorporated into a specialized database. Lastly, articles were matched to PubMed bibliographic metadata through automation and human review. An online database of the collection was ultimately created. The collection was made part of a discovery search service, and semantic technologies have been explored as a method of creating access points.Conclusions: This case study shows how a small medical library made medical writings of the nineteenth and twentieth centuries available in electronic format for historic or semantic research, highlighting the efficiencies of contemporary information technology.

Download Full-text

SELECTION TECHNIQUE FOR MULTIPLE OUTPUTS OF OPTICAL CHARACTER RECOGNITION

Eurasian Journal of Mathematical and Computer Applications ◽

10.32523/2306-6172-2020-8-2-41-51 ◽

2020 ◽

Vol 8 (2) ◽

pp. 41-51

Author(s):

I.Q. Habeeb ◽

Z.Q. Al-Zaydi ◽

H.N. Abdulkhudhur

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Selection Technique ◽

Multiple Outputs ◽

Optical Character

Download Full-text