scholarly journals Wikisource as a tool for OCR transcription correction: The National Library of Scotland’s response to COVID-19

Author(s):  
Gavin Willshaw

This chapter focuses on the National Library of Scotland’s Wikisource transcription correction project, an organization-wide effort during lockdown that generated 1,000 fully accurate transcriptions of 3,000 Scottish chapbooks, which the Library had uploaded to Wikisource, Wikimedia’s online library of digitized, out of copyright works. The project, which contributed to the Library being awarded Partnership of the Year 2020 at the Wikimedia UK AGM, is thought to be the largest ever staff engagement with Wikimedia, and has had significant benefits to the Library and staff well beyond the original aims of the project. Initially set up to improve the quality of optical character recognition (OCR) transcriptions in order to make the chapbooks more discoverable and searchable, the project gave staff a purpose and sense of belonging during lockdown, provided an opportunity to work with a varied and fascinating collection, and enabled them to develop new skills in editing Wikisource, drafting guidance documentation, and managing projects. Further to this, the initiative greatly increased library staff engagement with Wikimedia, led to the formation of a Wikimedia Community of Interest, and resulted in the embedding of Wikimedia activity in staff work.

1979 ◽  
Vol 73 (10) ◽  
pp. 389-399
Author(s):  
Gregory L. Goodrich ◽  
Richard R. Bennett ◽  
William R. De L'aune ◽  
Harvey Lauer ◽  
Leonard Mowinski

This study was designed to assess the Kurzweil Reading Machine's ability to read three different type styles produced by five different means. The results indicate that the Kurzweil Reading Machines tested have different error rates depending upon the means of producing the copy and upon the type style used; there was a significant interaction between copy method and type style. The interaction indicates that some type styles are better read when the copy is made by one means rather than another. Error rates varied between less than one percent and more than twenty percent. In general, the user will find that high quality printed materials will be read with a relatively high level of accuracy, but as the quality of the material decreases, the number of errors made by the machine also increases. As this error rate increases, the user will find it increasingly difficult to understand the spoken output.


2021 ◽  
Vol 4 (1) ◽  
pp. 57-70
Author(s):  
Marina V. Polyakova ◽  
Alexandr G. Nesteryuk

Optical character recognition systems for the images are used to convert books and documents into electronic form, to automate accounting systems in business, when recognizing markers using augmented reality technologies and etс. The quality of optical character recognition, provided that binarization is applied, is largely determined by the quality of separation of the foreground pixels from the background. Methods of text image binarization are analyzed and insufficient quality of binarization is noted. As a way of research the minimum-distance classifier for the improvement of the existing method of binarization of color text images is used. To improve the quality of the binarization of color text images, it is advisable to divide image pixels into two classes, “Foreground” and “Background”, to use classification methods instead of heuristic threshold selection, namely, a minimum-distance classifier. To reduce the amount of processed information before applying the classifier, it is advisable to select blocks of pixels for subsequent processing. This was done by analyzing the connected components on the original image. An improved method of the color text image binarization with the use of analysis of connected components and minimum-distance classifier has been elaborated. The research of the elaborated method showed that it is better than existing binarization methods in terms of robustness of binarization, but worse in terms of the error of the determining the boundaries of objects. Among the recognition errors, the pixels of images from the class labeled “Foreground” were more often mistaken for the class labeled “Background”. The proposed method of binarization with the uniqueness of class prototypes is recommended to be used in problems of the processing of color images of the printed text, for which the error in determining the boundaries of characters as a result of binarization is compensated by the thickness of the letters. With a multiplicity of class prototypes, the proposed binarization method is recommended to be used in problems of processing color images of handwritten text, if high performance is not required. The improved binarization method has shown its efficiency in cases of slow changes in the color and illumination of the text and background, however, abrupt changes in color and illumination, as well as a textured background, do not allowing the binarization quality required for practical problems.


Author(s):  
Michael Plotnikov ◽  
Paul W. Shuldiner

The ability of an automated license plate reading (ALPR) system to convert video images of license plates into computer records depends on many factors. Of these, two are readily controlled by the operator: the quality of the video images captured in the field and the internal settings of the ALPR used to transcribe these images. A third factor, the light conditions under which the license plate images are acquired, is less easily managed, especially when camcorders are used in the field under ambient light conditions. A set of experiments was conducted to test the effects of ambient light conditions, video camcorder adjustments, and internal ALPR settings on the percent of correct reads attained by a specific type of ALPR, one whose optical character recognition process is based on template matching. Images of rear license plates were collected under four ambient light conditions: overcast with no shadows, and full sunlight with the sun in front of the camcorder, behind the camcorder, and orthogonal to the line of sight. Three camcorder exposure settings were tested. Two of the settings made use of the camcorder’s internal light meter, and the third relied solely on operator judgment. The license plates read ranged from 41% to 72%, depending most strongly on ambient light conditions. In all cases, careful adjustment of the ALPR led to significantly improved read rates over those obtained by using the manufacturer’s recommended default settings. Exposure settings based on the operator’s judgment worked best in all instances.


Author(s):  
Sameer M. Patel ◽  
Sarvesh S. Pai ◽  
Mittal B. Jain ◽  
Vaibhav P. Vasani

Optical Character Recognition is basically the mechanical or electronic conversion of printed or handwritten text into machine understandable text. The complication of Optical Character Recognition in different conditions remains as relevant as it was in the past few years. At the present time of automation and innovations, Keyboarding remains the most common way of inputting or feeding data into computers. This is probably the most time consuming and labor-intensive operation in the industry. Automating the process of recognition of documents, credit cards, electronic invoices, and license plates of cars – all of this could help in saving time for analyzing and processing data. With the increased research and development of machine learning, the quality of text recognition is continuously growing better. Our paper is focused on providing a brief explanation of the different stages involved in the process of optical character recognition and through the proposed application; we aim to automate the process of extraction of important texts from electronic invoices. The main goal of the project is to develop a real time OCR web application with a micro service architecture, which would help in extracting necessary information from an invoice.


2019 ◽  
Vol 34 (4) ◽  
pp. 825-843 ◽  
Author(s):  
Mark J Hill ◽  
Simon Hengchen

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.


2022 ◽  
Vol 20 (8) ◽  
pp. 3080
Author(s):  
A. A. Komkov ◽  
V. P. Mazaev ◽  
S. V. Ryazanova ◽  
D. N. Samochatov ◽  
E. V. Koshkina ◽  
...  

RuPatient health information system (HIS) is a computer program consisting of a doctor-patient web user interface, which includes algorithms for recognizing medical record text and entering it into the corresponding fields of the system.Aim. To evaluate the effectiveness of RuPatient HIS in actual clinical practice.Material and methods. The study involved 10 cardiologists and intensivists of the department of cardiology and сardiovascular intensive care unit of the L. A. Vorokhobov City Clinical Hospital 67 We analyzed images (scanned copies, photos) of discharge reports from patients admitted to the relevant departments in 2021. The following fields of medical documentation was recognized: Name, Complaints, Anamnesis of life and illness, Examination, Recommendations. The correctness and accuracy of recognition of entered information were analyzed. We compared the recognition quality of RuPatient HIS and a popular optical character recognition application (FineReader for Mac).Results. The study included 77 pages of discharge reports of patients from various hospitals in Russia from 50 patients (men, 52%). The mean age of patients was 57,7±7,9 years. The number of reports with correctly recognized fields in various categories using the program algorithms was distributed as follows: Name — 14 (28%), Diagnosis — 13 (26%), Complaints — 40 (80%), Anamnesis — 14 (28%), Examination — 24 (48%), Recommendations — 46 (92%). Data that was not included in the category was also recognized and entered in the comments field. The number of recognized words was 549±174,9 vs 522,4±215,6 (p=0,5), critical errors in words — 2,1±1,6 vs 4,4±2,8 (p<0,001), non-critical errors — 10,3±4,3 vs 5,6±3,3 (p<0,001) for RuPatient HIS and optical character recognition application for a personal computer, respectively.Conclusion. The developed RuPatient HIS, which includes a module for recognizing medical records and entering data into the corresponding fields, significantly increases the document management efficiency with high quality of optical character recognition based on neural network technologies and the automation of filling process.


2018 ◽  
Vol 7 (4.36) ◽  
pp. 780
Author(s):  
Sajan A. Jain ◽  
N. Shobha Rani ◽  
N. Chandan

Enhancement of document images is an interesting research challenge in the process of character recognition. It is quite significant to have a document with uniform illumination gradient to achieve higher recognition accuracies through a document processing system like Optical Character Recognition (OCR). Complex document images are one of the varied image categories that are difficult to process compared to other types of images. It is the quality of document that decides the precision of a character recognition system. Hence transforming the complex document images to a uniform illumination gradient is foreseen. In the proposed research, ancient document images of UMIACS Tobacco 800 database are considered for removal of marginal noise. The proposed technique carries out the block wise interpretation of document contents to remove the marginal noise that is present usually at the borders of images. Further, Hu moment’s features are computed for the detection of marginal noise in every block. An empirical analysis is carried out for classification of blocks into noisy or non-noisy and the outcomes produced by algorithm are satisfactory and feasible for subsequent analysis. 


2015 ◽  
pp. 45 ◽  
Author(s):  
Miikka Silfverberg ◽  
Jack Rueter

Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analyzers. We present experiments on two Uralic languages Finnish and Erzya. According to our experiments, word lists may still be superior to morphological analyzers in OCR even for languages with rich morphology. Our error analysis indicates that morphological analyzers can cause a large amount of real word OCR errors.


2020 ◽  
Vol 72 (4) ◽  
pp. 545-559
Author(s):  
Hrvoje Stančić ◽  
Željko Trbušić

PurposeThe authors investigate optical character recognition (OCR) technology and discuss its implementation in the context of digitisation of archival materials.Design/methodology/approachThe typewritten transcripts of the Croatian Writers' Society from the mid-60s of the 20th century are used as the test data. The optimal digitisation setup is investigated in order to obtain the best OCR results. This was done by using the sample of 123 pages digitised at different resolution settings and binarisation levels.FindingsA series of tests showed that different settings produce significantly different results. The best OCR accuracy achieved at the test sample of the typewritten documents was 95.02%. The results show that the resolution is significantly more important than binarisation pre-processing procedure for achieving better OCR results.Originality/valueBased on the research results, the authors give recommendations for achieving optimal digitisation process setup with the aim of increasing the quality of OCR results. Finally, the authors put the research results in the context of digitisation of cultural heritage in general and discuss further investigation possibilities.


Sign in / Sign up

Export Citation Format

Share Document