TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Zhaoyi Wan; Minghang He; Haoran Chen; Xiang Bai; Cong Yao

doi:10.1609/aaai.v34i07.6891

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6891 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12120-12127 ◽

Cited By ~ 1

Author(s):

Zhaoyi Wan ◽

Minghang He ◽

Haoran Chen ◽

Xiang Bai ◽

Cong Yao

Keyword(s):

State Of The Art ◽

Semantic Segmentation ◽

Text Recognition ◽

Context Modeling ◽

Alternative Approach ◽

Scene Text ◽

Benchmark Datasets ◽

Character Position ◽

Scene Text Recognition ◽

Character Class

Driven by deep learning and a large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention-based methods have dominated this field, but suffer from the problem of attention drift in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention-based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in the correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such as Chinese transcripts and aligning with target characters.

Download Full-text

Multi-granularity Deep Local Representations for Irregular Scene Text Recognition

ACM/IMS Transactions on Data Science ◽

10.1145/3446971 ◽

2021 ◽

Vol 2 (2) ◽

pp. 1-18

Author(s):

Hongchao Gao ◽

Yujia Li ◽

Jiao Dai ◽

Xi Wang ◽

Jizhong Han ◽

...

Keyword(s):

State Of The Art ◽

Visual Representation ◽

Text Recognition ◽

Natural Scene ◽

Attention Network ◽

Training Time ◽

Scene Text ◽

Benchmark Datasets ◽

Local Representations ◽

Scene Text Recognition

Recognizing irregular text from natural scene images is challenging due to the unconstrained appearance of text, such as curvature, orientation, and distortion. Recent recognition networks regard this task as a text sequence labeling problem and most networks capture the sequence only from a single-granularity visual representation, which to some extent limits the performance of recognition. In this article, we propose a hierarchical attention network to capture multi-granularity deep local representations for recognizing irregular scene text. It consists of several hierarchical attention blocks, and each block contains a Local Visual Representation Module (LVRM) and a Decoder Module (DM). Based on the hierarchical attention network, we propose a scene text recognition network. The extensive experiments show that our proposed network achieves the state-of-the-art performance on several benchmark datasets including IIIT-5K, SVT, CUTE, SVT-Perspective, and ICDAR datasets under shorter training time.

Download Full-text

Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018610 ◽

2019 ◽

Vol 33 ◽

pp. 8610-8617 ◽

Cited By ~ 18

Author(s):

Hui Li ◽

Peng Wang ◽

Chunhua Shen ◽

Guyu Zhang

Keyword(s):

State Of The Art ◽

Text Recognition ◽

Natural Scene ◽

Fine Grained ◽

Word Level ◽

Scene Text ◽

Sophisticated Model ◽

Regular Text ◽

Algorithm Implementation ◽

Scene Text Recognition

Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using offthe-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTMbased encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust. It achieves state-of-the-art performance on irregular text recognition benchmarks and comparable results on regular text datasets. The code will be released.

Download Full-text

GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6735 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11005-11012 ◽

Cited By ~ 1

Author(s):

Wenyang Hu ◽

Xiaocong Cai ◽

Jun Hou ◽

Shuai Yi ◽

Zhiping Lin

Keyword(s):

State Of The Art ◽

Text Recognition ◽

Convolutional Network ◽

Attentional Guidance ◽

Feature Representations ◽

Lower Accuracy ◽

Scene Text ◽

Local Correlations ◽

Scene Text Recognition ◽

Connectionist Temporal Classification

Connectionist Temporal Classification (CTC) and attention mechanism are two main approaches used in recent scene text recognition works. Compared with attention-based methods, CTC decoder has a much shorter inference time, yet a lower accuracy. To design an efficient and effective model, we propose the guided training of CTC (GTC), where CTC model learns a better alignment and feature representations from a more powerful attentional guidance. With the benefit of guided training, CTC model achieves robust and accurate prediction for both regular and irregular scene text while maintaining a fast inference speed. Moreover, to further leverage the potential of CTC decoder, a graph convolutional network (GCN) is proposed to learn the local correlations of extracted features. Extensive experiments on standard benchmarks demonstrate that our end-to-end model achieves a new state-of-the-art for regular and irregular scene text recognition and needs 6 times shorter inference time than attention-based methods.

Download Full-text

Scene Text Recognition from Two-Dimensional Perspective

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018714 ◽

2019 ◽

Vol 33 ◽

pp. 8714-8721 ◽

Cited By ~ 26

Author(s):

Minghui Liao ◽

Jian Zhang ◽

Zhaoyi Wan ◽

Fengming Xie ◽

Jiajun Liang ◽

...

Keyword(s):

Dimensional Space ◽

Semantic Segmentation ◽

Word Formation ◽

Text Recognition ◽

Two Dimensional ◽

Convolutional Network ◽

One Dimensional ◽

Sequence Prediction ◽

Scene Text ◽

Scene Text Recognition

Inspired by speech recognition, recent state-of-the-art algorithms mostly consider scene text recognition as a sequence prediction problem. Though achieving excellent performance, these methods usually neglect an important fact that text in images are actually distributed in two-dimensional space. It is a nature quite different from that of speech, which is essentially a one-dimensional signal. In principle, directly compressing features of text into a one-dimensional form may lose useful information and introduce extra noise. In this paper, we approach scene text recognition from a two-dimensional perspective. A simple yet effective model, called Character Attention Fully Convolutional Network (CA-FCN), is devised for recognizing the text of arbitrary shapes. Scene text recognition is realized with a semantic segmentation network, where an attention mechanism for characters is adopted. Combined with a word formation module, CA-FCN can simultaneously recognize the script and predict the position of each character. Experiments demonstrate that the proposed algorithm outperforms previous methods on both regular and irregular text datasets. Moreover, it is proven to be more robust to imprecise localizations in the text detection phase, which are very common in practice.

Download Full-text

Analysis of Text Identification Techniques Using Scene Text and Optical Character Recognition

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2021100104 ◽

2021 ◽

Vol 11 (4) ◽

pp. 39-62

Author(s):

Monica Gupta ◽

Alka Choudhary ◽

Jyotsna Parmar

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Short Term Memory ◽

Handwriting Recognition ◽

Text Recognition ◽

Stroke Width ◽

Optical Character ◽

Scene Text ◽

Benchmark Datasets ◽

Scene Text Recognition

In today's era, data in digitalized form is needed for faster processing and performing of all tasks. The best way to digitalize the documents is by extracting the text from them. This work of text extraction can be performed by various text identification tasks such as scene text recognition, optical character recognition, handwriting recognition, and much more. This paper presents, reviews, and analyses recent research expansion in the area of optical character recognition and scene text recognition based on various existing models such as convolutional neural network, long short-term memory, cognitive reading for image processing, maximally stable extreme regions, stroke width transformation, and achieved remarkable results up to 90.34% of F-score with benchmark datasets such as ICDAR 2013, ICDAR 2019, IIIT5k. The researchers have done outstanding work in the text recognition field. Yet, improvement in text detection in low-quality image performance is required, as text identification should not be limited to the input quality of the image.

Download Full-text

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

10.20944/preprints201812.0114.v1 ◽

2018 ◽

Author(s):

Chankyu Choi ◽

Youngmin Yoon ◽

Junsu Lee ◽

Junseok Kim

Keyword(s):

State Of The Art ◽

Asian Countries ◽

Directional Information ◽

Proposed Model ◽

Art Scene ◽

Scene Text ◽

Benchmark Datasets ◽

Tv Commercials ◽

Different Characteristics ◽

Scene Text Recognition

Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.

Download Full-text

2D Positional Embedding-based Transformer for Scene Text Recognition

Journal of Computational Vision and Imaging Systems ◽

10.15353/jcvis.v6i1.3533 ◽

2021 ◽

Vol 6 (1) ◽

pp. 1-4

Author(s):

Zobeir Raisi ◽

Mohamed A. Naiel ◽

Paul Fieguth ◽

Steven Wardell ◽

John Zelek

Keyword(s):

Spatial Information ◽

State Of The Art ◽

Image Features ◽

Text Recognition ◽

Recognition Method ◽

One Dimensional ◽

Art Scene ◽

Scene Text ◽

In The Wild ◽

Scene Text Recognition

Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images. In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition.

Download Full-text

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Electronics ◽

10.3390/electronics10222780 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2780

Author(s):

Yue Tao ◽

Zhiwei Jia ◽

Runze Ma ◽

Shugong Xu

Keyword(s):

Text Recognition ◽

Context Modeling ◽

Research Attention ◽

Global Context ◽

Scene Text ◽

Text Features ◽

Three Stages ◽

The Relationship ◽

Scene Text Recognition ◽

Remarkable Progress

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.

Download Full-text

Arabic Cursive Text Recognition from Natural Scene Images

Applied Sciences ◽

10.3390/app9020236 ◽

2019 ◽

Vol 9 (2) ◽

pp. 236 ◽

Cited By ~ 6

Author(s):

Saad Ahmed ◽

Saeeda Naz ◽

Muhammad Razzak ◽

Rubiyah Yusof

Keyword(s):

Recognition System ◽

Document Image ◽

Text Recognition ◽

Chinese Script ◽

Challenging Problem ◽

Future Directions ◽

Scene Text ◽

Comprehensive Survey ◽

Recognition Systems ◽

Scene Text Recognition

This paper presents a comprehensive survey on Arabic cursive scene text recognition. The recent years’ publications in this field have witnessed the interest shift of document image analysis researchers from recognition of optical characters to recognition of characters appearing in natural images. Scene text recognition is a challenging problem due to the text having variations in font styles, size, alignment, orientation, reflection, illumination change, blurriness and complex background. Among cursive scripts, Arabic scene text recognition is contemplated as a more challenging problem due to joined writing, same character variations, a large number of ligatures, the number of baselines, etc. Surveys on the Latin and Chinese script-based scene text recognition system can be found, but the Arabic like scene text recognition problem is yet to be addressed in detail. In this manuscript, a description is provided to highlight some of the latest techniques presented for text classification. The presented techniques following a deep learning architecture are equally suitable for the development of Arabic cursive scene text recognition systems. The issues pertaining to text localization and feature extraction are also presented. Moreover, this article emphasizes the importance of having benchmark cursive scene text dataset. Based on the discussion, future directions are outlined, some of which may provide insight about cursive scene text to researchers.

Download Full-text

A discriminative semi-Markov model for robust scene text recognition

2008 19th International Conference on Pattern Recognition ◽

10.1109/icpr.2008.4761818 ◽

2008 ◽

Cited By ~ 3

Author(s):

Jerod J. Weinman ◽

Erik Learned-Miller ◽

Allen Hanson

Keyword(s):

Markov Model ◽

Text Recognition ◽

Scene Text ◽

Scene Text Recognition

Download Full-text