scholarly journals Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor

2020 ◽  
Vol 10 (13) ◽  
pp. 4474 ◽  
Author(s):  
Direselign Addis Tadesse ◽  
Chuan-Ming Liu ◽  
Van-Dai Ta

Reading text and unified text detection and recognition from natural images are the most challenging applications in computer vision and document analysis. Previously proposed end-to-end scene text reading methods do not consider the frequency of input images at feature extraction, which slows down the system, requires more memory, and recognizes text inaccurately. In this paper, we proposed an octave convolution (OctConv) feature extractor and a time-restricted attention encoder-decoder module for end-to-end scene text reading. The OctConv can extract features by factorizing the input image based on their frequency. It is a direct replacement of convolutions, orthogonal and complementary, for reducing redundancies and helps to boost the reading text through low memory requirements at a faster speed. In the text reading process, features are first extracted from the input image using Feature Pyramid Network (FPN) with OctConv Residual Network with depth 50 (ResNet50). Then, a Region Proposal Network (RPN) is applied to predict the location of the text area by using extracted features. Finally, a time-restricted attention encoder-decoder module is applied after the Region of Interest (RoI) pooling is performed. A bilingual real and synthetic scene text dataset is prepared for training and testing the proposed model. Additionally, well-known datasets including ICDAR2013, ICDAR2015, and Total Text are used for fine-tuning and evaluating its performance with previously proposed state-of-the-art methods. The proposed model shows promising results on both regular and irregular or curved text detection and reading tasks.

2020 ◽  
Vol 34 (07) ◽  
pp. 11899-11907 ◽  
Author(s):  
Liang Qiao ◽  
Sanli Tang ◽  
Zhanzhan Cheng ◽  
Yunlu Xu ◽  
Yi Niu ◽  
...  

Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an end-to-end trainable text spotting approach named Text Perceptron. Concretely, Text Perceptron first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies without extra parameters. It unites text detection and the following recognition part into a whole framework, and helps the whole network achieve global optimization. Experiments show that our method achieves competitive performance on two standard text benchmarks, i.e., ICDAR 2013 and ICDAR 2015, and also obviously outperforms existing methods on irregular text benchmarks SCUT-CTW1500 and Total-Text.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Sajid Hussain ◽  
Hammad Afzal ◽  
Ramsha Saeed ◽  
Naima Iltaf ◽  
Mir Yasir Umair

Adverse drug reactions (ADRs) are the undesirable effects associated with the use of a drug due to some pharmacological action of the drug. During the last few years, social media has become a popular platform where people discuss their health problems and, therefore, has become a popular source to share information related to ADR in the natural language. This paper presents an end-to-end system for modelling ADR detection from the given text by fine-tuning BERT with a highly modular Framework for Adapting Representation Models (FARM). BERT overcame the predominant neural networks bringing remarkable performance gains. However, training BERT is a computationally expensive task which limits its usage for production environments and makes it difficult to determine the most important hyperparameters for the downstream task. Furthermore, developing an end-to-end ADR extraction system comprising two downstream tasks, i.e., text classification for filtering text containing ADRs and extracting ADR mentions from the classified text, is also challenging. The framework used in this work, FARM-BERT, provides support for multitask learning by combining multiple prediction heads which makes training of the end-to-end systems easier and computationally faster. In the proposed model, one prediction head is used for text classification and the other is used for ADR sequence labeling. Experiments are performed on Twitter, PubMed, TwiMed-Twitter, and TwiMed-PubMed datasets. The proposed model is compared with the baseline models and state-of-the-art techniques, and it is shown that it yields better results for the given task with the F -scores of 89.6%, 97.6%, 84.9%, and 95.9% on Twitter, PubMed, TwiMed-Twitter, and TwiMed-PubMed datasets, respectively. Moreover, training time and testing time of the proposed model are compared with BERT’s, and it is shown that the proposed model is computationally faster than BERT.


2019 ◽  
pp. 108-117
Author(s):  
В’ячеслав Васильович Москаленко ◽  
Микола Олександрович Зарецький ◽  
Альона Сергіївна Москаленко

The classification model which consists of the motion detector, object tracker, convolutional sparse coded feature extractor and stacked information-extreme classifier is developed. It is proposed to build a motion detector based on the difference of consecutive aligned frames where alignment is performed via keypoints matching, homography estimation, and projective transformations. Motion detector seeks to simplify object classification task through reduction of input data variations and resource savings for motion region search model synthesis without training. The proposed model is characterized by low computational complexity and it can be used as labeling dataset gathering tool for deep moveable object detector. Furthermore, the training method for moving object detector is developed. The method consisting in unsupervised pretraining feature extractor based on sparse coding neural gas, supervised pretraining and following fine-tuning of stacked information-extreme classifier. Using soft-competitive learning scheme in sparse coding neural gas facilitates robust convergence to close to optimal distributions of the neurons over the data. Sparse coding neural gas reduces the requirements for the volume of labeled observations and computational resource. As a criterion for the effectiveness of classifier's machine training, the normalized modification of S. Kullback’s information measure is considered. Labeling new emerging data through self-labeling for high prediction score cases and manual labeling for low prediction score cases, and following labeled object tracking are also offered. In this case, class balancing using undersampling within dichotomous strategy “one-against-all”. The set of classes include bicycle, bus, car, motorcycle, pickup truck, articulated truck, and background. Simulation results on MIO-TCD dataset confirm the suitability of the proposed model and training method for practical usage. 


2020 ◽  
Vol 34 (07) ◽  
pp. 12160-12167 ◽  
Author(s):  
Hao Wang ◽  
Pu Lu ◽  
Hui Zhang ◽  
Mingkun Yang ◽  
Xiang Bai ◽  
...  

Recently, end-to-end text spotting that aims to detect and recognize text from cluttered images simultaneously has received particularly growing interest in computer vision. Different from the existing approaches that formulate text detection as bounding box extraction or instance segmentation, we localize a set of points on the boundary of each text instance. With the representation of such boundary points, we establish a simple yet effective scheme for end-to-end text spotting, which can read the text of arbitrary shapes. Experiments on three challenging datasets, including ICDAR2015, TotalText and COCO-Text demonstrate that the proposed method consistently surpasses the state-of-the-art in both scene text detection and end-to-end text recognition tasks.


2020 ◽  
Vol 10 (6) ◽  
pp. 2096 ◽  
Author(s):  
Minjun Jeon ◽  
Young-Seob Jeong

Scene text detection is the task of detecting word boxes in given images. The accuracy of text detection has been greatly elevated using deep learning models, especially convolutional neural networks. Previous studies commonly aimed at developing more accurate models, but their models became computationally heavy and worse in efficiency. In this paper, we propose a new efficient model for text detection. The proposed model, namely Compact and Accurate Scene Text detector (CAST), consists of MobileNetV2 as a backbone and balanced decoder. Unlike previous studies that used standard convolutional layers as a decoder, we carefully design a balanced decoder. Through experiments with three well-known datasets, we then demonstrated that the balanced decoder and the proposed CAST are efficient and effective. The CAST was about 1.1x worse in terms of the F1 score, but 30∼115x better in terms of floating-point operations per second (FLOPS).


Cybersecurity ◽  
2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Jianguo Jiang ◽  
Baole Wei ◽  
Min Yu ◽  
Gang Li ◽  
Boquan Li ◽  
...  

AbstractReading text in images automatically has become an attractive research topic in computer vision. Specifically, end-to-end spotting of scene text has attracted significant research attention, and relatively ideal accuracy has been achieved on several datasets. However, most of the existing works overlooked the semantic connection between the scene text instances, and had limitations in situations such as occlusion, blurring, and unseen characters, which result in some semantic information lost in the text regions. The relevance between texts generally lies in the scene images. From the perspective of cognitive psychology, humans often combine the nearby easy-to-recognize texts to infer the unidentifiable text. In this paper, we propose a novel graph-based method for intermediate semantic features enhancement, called Text Relation Networks. Specifically, we model the co-occurrence relationship of scene texts as a graph. The nodes in the graph represent the text instances in a scene image, and the corresponding semantic features are defined as representations of the nodes. The relative positions between text instances are measured as the weights of edges in the established graph. Then, a convolution operation is performed on the graph to aggregate semantic information and enhance the intermediate features corresponding to text instances. We evaluate the proposed method through comprehensive experiments on several mainstream benchmarks, and get highly competitive results. For example, on the , our method surpasses the previous top works by 2.1% on the word spotting task.


2021 ◽  
pp. 95-108
Author(s):  
Jiedong Hao ◽  
Yafei Wen ◽  
Jie Deng ◽  
Jun Gan ◽  
Shuai Ren ◽  
...  

2020 ◽  
Vol 10 (3) ◽  
pp. 1117 ◽  
Author(s):  
Birhanu Belay ◽  
Tewodros Habtegebrial ◽  
Million Meshesha ◽  
Marcus Liwicki ◽  
Gebeyehu Belay ◽  
...  

In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner, and transcriber in a unified module and then trained in an end-to-end fashion. The experimental results, on a printed and synthetic benchmark Amharic Optical Character Recognition (OCR) database called ADOCR, demonstrated that the proposed model outperforms state-of-the-art methods by 6.98% and 1.05%, respectively.


IEEE Access ◽  
2017 ◽  
Vol 5 ◽  
pp. 3193-3204 ◽  
Author(s):  
Xiaohang Ren ◽  
Yi Zhou ◽  
Zheng Huang ◽  
Jun Sun ◽  
Xiaokang Yang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document