Data Extraction from Images through OCR
The paperwork used in maintaining various types of documents in our daily lives is tiresome and inefficient, it consumes a lot of time and it is difficult to maintain and remember the concerned documents. This project provides a solution to these problems by introducing Optical Character Recognition Technology (OCR) which runs on Tesseract OCR Engine. The project specifically aims at increasing data accessibility, usability and improving customer experience by decreasing the time spent to process, save, and maintain user data. Another objective of this project is to nullify the human error, which is huge in manual handling of data records, the software used in the solution uses certain techniques to minimize these errors. Optical Character Recognition (OCR) is used for extracting texts and characters from an image. This helps us in maintaining our records and data digitally and securely. In this project we are using the Tesseract OCR Engine which has high accuracy rates for clean images. We have implemented a web version of OCR which runs on TesseractJS; other JavaScript frameworks are also used. The outcome of the project is that it is able successfully to extract text and characters from the provided image using Tesseract OCR Engine. It is observed that for the high resolution images the accuracy is above 90%. This web based application is useful for small businesses as they don’t have to install any extra software, all it needs is a file to be uploaded on an online interface making them able to access remotely. It will also help students to save notes and documents online which will make their important documents easily accessible on the web. This whole process is time and memory efficient.