A multi-pronged accurate approach to optical character recognition, using nearest neighborhood and neural-network-based principles

Sadhana ◽  
2021 ◽  
Vol 46 (4) ◽  
Author(s):  
G Kishor Kumar ◽  
R Raja Kumar ◽  
Ram Chakka ◽  
P Viswanath
2021 ◽  
Vol 15 ◽  
Author(s):  
Pooja Jain ◽  
Kavita Taneja ◽  
Harmunish Taneja

Background: Instant access to desired information is crucial for building an intelligent environment that creates value for people and steering towards society 5.0. Online newspapers are one such example that provides instant access to information anywhere and anytime on our mobiles, tablets, laptops, desktops, etc. However, when it comes to searching for a specific advertisement in newspapers, online newspapers do not provide easy advertisement search options. In addition, there are no specialized search portals for keyword-based advertisement searches across multiple online newspapers. As a result, to find a specific advertisement in multiple newspapers, a sequential manual search is required across a range of online newspapers. Objective: This research paper proposes a keyword-based advertisement search framework to provide an instant access to the relevant advertisements from online English newspapers in a category of reader’s choice. Method: First, an image extraction algorithm is proposed to identify and extract the images from online newspapers without using any rules on advertisement placement and size. It is followed by a proposed deep learning Convolutional Neural Network (CNN) model named ‘Adv_Recognizer’ to separate the advertisement images from non-advertisement images. Another CNN Model, ‘Adv_Classifier’, is proposed, classifying the advertisement images into four pre-defined categories. Finally, the Optical Character Recognition (OCR) technique performs keyword-based advertisement searches in various types across multiple newspapers. Results: The proposed image extraction algorithm can easily extract all types of well-bounded images from different online newspapers. This algorithm is used to create an ‘English newspaper image dataset’ of 11,000 images, including advertisements and non-advertisements. The proposed ‘Adv_Recognizer’ model separates advertising and non-advertisement pictures with an accuracy of around 97.8%. In addition, the proposed ‘Adv_Classifier’ model classifies the advertisements in four pre-defined categories exhibiting an accuracy of approximately 73.5%. Conclusion: The proposed framework will help newspaper readers perform exhaustive advertisement searches across various online English newspapers in a category of their interest. It will also help in carrying out advertisement analysis and studies.


Sign in / Sign up

Export Citation Format

Share Document