Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Mapping Intimacies ◽

10.20944/preprints201812.0114.v1 ◽

2018 ◽

Author(s):

Chankyu Choi ◽

Youngmin Yoon ◽

Junsu Lee ◽

Junseok Kim

Keyword(s):

State Of The Art ◽

Asian Countries ◽

Directional Information ◽

Proposed Model ◽

Art Scene ◽

Scene Text ◽

Benchmark Datasets ◽

Tv Commercials ◽

Different Characteristics ◽

Scene Text Recognition

Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.

Download Full-text

Multi-granularity Deep Local Representations for Irregular Scene Text Recognition

ACM/IMS Transactions on Data Science ◽

10.1145/3446971 ◽

2021 ◽

Vol 2 (2) ◽

pp. 1-18

Author(s):

Hongchao Gao ◽

Yujia Li ◽

Jiao Dai ◽

Xi Wang ◽

Jizhong Han ◽

...

Keyword(s):

State Of The Art ◽

Visual Representation ◽

Text Recognition ◽

Natural Scene ◽

Attention Network ◽

Training Time ◽

Scene Text ◽

Benchmark Datasets ◽

Local Representations ◽

Scene Text Recognition

Recognizing irregular text from natural scene images is challenging due to the unconstrained appearance of text, such as curvature, orientation, and distortion. Recent recognition networks regard this task as a text sequence labeling problem and most networks capture the sequence only from a single-granularity visual representation, which to some extent limits the performance of recognition. In this article, we propose a hierarchical attention network to capture multi-granularity deep local representations for recognizing irregular scene text. It consists of several hierarchical attention blocks, and each block contains a Local Visual Representation Module (LVRM) and a Decoder Module (DM). Based on the hierarchical attention network, we propose a scene text recognition network. The extensive experiments show that our proposed network achieves the state-of-the-art performance on several benchmark datasets including IIIT-5K, SVT, CUTE, SVT-Perspective, and ICDAR datasets under shorter training time.

Download Full-text

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6891 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12120-12127 ◽

Cited By ~ 1

Author(s):

Zhaoyi Wan ◽

Minghang He ◽

Haoran Chen ◽

Xiang Bai ◽

Cong Yao

Keyword(s):

State Of The Art ◽

Semantic Segmentation ◽

Text Recognition ◽

Context Modeling ◽

Alternative Approach ◽

Scene Text ◽

Benchmark Datasets ◽

Character Position ◽

Scene Text Recognition ◽

Character Class

Driven by deep learning and a large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention-based methods have dominated this field, but suffer from the problem of attention drift in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention-based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in the correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such as Chinese transcripts and aligning with target characters.

Download Full-text

2D Positional Embedding-based Transformer for Scene Text Recognition

Journal of Computational Vision and Imaging Systems ◽

10.15353/jcvis.v6i1.3533 ◽

2021 ◽

Vol 6 (1) ◽

pp. 1-4

Author(s):

Zobeir Raisi ◽

Mohamed A. Naiel ◽

Paul Fieguth ◽

Steven Wardell ◽

John Zelek

Keyword(s):

Spatial Information ◽

State Of The Art ◽

Image Features ◽

Text Recognition ◽

Recognition Method ◽

One Dimensional ◽

Art Scene ◽

Scene Text ◽

In The Wild ◽

Scene Text Recognition

Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images. In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition.

Download Full-text

Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6284 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7797-7804

Author(s):

Goran Glavašš ◽

Swapna Somasundaran

Keyword(s):

State Of The Art ◽

Language Transfer ◽

Text Segmentation ◽

Word Embeddings ◽

Neural Architecture ◽

Text Coherence ◽

Sentence Level ◽

Proposed Model ◽

Benchmark Datasets ◽

Cross Lingual

Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling. Our model – a neural architecture consisting of two hierarchically connected Transformer networks – is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones. The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets. Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training.

Download Full-text

Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018610 ◽

2019 ◽

Vol 33 ◽

pp. 8610-8617 ◽

Cited By ~ 18

Author(s):

Hui Li ◽

Peng Wang ◽

Chunhua Shen ◽

Guyu Zhang

Keyword(s):

State Of The Art ◽

Text Recognition ◽

Natural Scene ◽

Fine Grained ◽

Word Level ◽

Scene Text ◽

Sophisticated Model ◽

Regular Text ◽

Algorithm Implementation ◽

Scene Text Recognition

Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using offthe-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTMbased encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust. It achieves state-of-the-art performance on irregular text recognition benchmarks and comparable results on regular text datasets. The code will be released.

Download Full-text

Hybrid pooling with wavelets for convolutional neural networks

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219223 ◽

2022 ◽

pp. 1-10

Author(s):

Daniel Trevino-Sanchez ◽

Vicente Alarcon-Aquino

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Computational Cost ◽

Relevant Information ◽

Accuracy Improvement ◽

Proposed Model ◽

Benchmark Datasets ◽

Augmentation Techniques ◽

High Computational Cost

The need to detect and classify objects correctly is a constant challenge, being able to recognize them at different scales and scenarios, sometimes cropped or badly lit is not an easy task. Convolutional neural networks (CNN) have become a widely applied technique since they are completely trainable and suitable to extract features. However, the growing number of convolutional neural networks applications constantly pushes their accuracy improvement. Initially, those improvements involved the use of large datasets, augmentation techniques, and complex algorithms. These methods may have a high computational cost. Nevertheless, feature extraction is known to be the heart of the problem. As a result, other approaches combine different technologies to extract better features to improve the accuracy without the need of more powerful hardware resources. In this paper, we propose a hybrid pooling method that incorporates multiresolution analysis within the CNN layers to reduce the feature map size without losing details. To prevent relevant information from losing during the downsampling process an existing pooling method is combined with wavelet transform technique, keeping those details "alive" and enriching other stages of the CNN. Achieving better quality characteristics improves CNN accuracy. To validate this study, ten pooling methods, including the proposed model, are tested using four benchmark datasets. The results are compared with four of the evaluated methods, which are also considered as the state-of-the-art.

Download Full-text

Histopathological Classification of Breast Cancer Images Using a Multi-Scale Input and Multi-Feature Network

Cancers ◽

10.3390/cancers12082031 ◽

2020 ◽

Vol 12 (8) ◽

pp. 2031 ◽

Cited By ~ 2

Author(s):

Taimoor Shakeel Sheikh ◽

Yonghee Lee ◽

Migyung Cho

Keyword(s):

State Of The Art ◽

Texture Features ◽

Feature Maps ◽

Histopathological Classification ◽

Multi Scale ◽

Machine Learning Methods ◽

Proposed Model ◽

Benchmark Datasets ◽

Histopathological Images

Diagnosis of pathologies using histopathological images can be time-consuming when many images with different magnification levels need to be analyzed. State-of-the-art computer vision and machine learning methods can help automate the diagnostic pathology workflow and thus reduce the analysis time. Automated systems can also be more efficient and accurate, and can increase the objectivity of diagnosis by reducing operator variability. We propose a multi-scale input and multi-feature network (MSI-MFNet) model, which can learn the overall structures and texture features of different scale tissues by fusing multi-resolution hierarchical feature maps from the network’s dense connectivity structure. The MSI-MFNet predicts the probability of a disease on the patch and image levels. We evaluated the performance of our proposed model on two public benchmark datasets. Furthermore, through ablation studies of the model, we found that multi-scale input and multi-feature maps play an important role in improving the performance of the model. Our proposed model outperformed the existing state-of-the-art models by demonstrating better accuracy, sensitivity, and specificity.

Download Full-text

EPOC: Efficient Perception via Optimal Communication

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5830 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4107-4114 ◽

Cited By ~ 1

Author(s):

Masoumeh Heidari Kapourchali ◽

Bonny Banerjee

Keyword(s):

Recognition Accuracy ◽

State Of The Art ◽

The State ◽

Belief State ◽

Sensor Data ◽

Communication Policy ◽

Agent Model ◽

Proposed Model ◽

Benchmark Datasets ◽

Communication Policies

We propose an agent model capable of actively and selectively communicating with other agents to predict its environmental state efficiently. Selecting whom to communicate with is a challenge when the internal model of other agents is unobservable. Our agent learns a communication policy as a mapping from its belief state to with whom to communicate in an online and unsupervised manner, without any reinforcement. Human activity recognition from multimodal, multisource and heterogeneous sensor data is used as a testbed to evaluate the proposed model where each sensor is assumed to be monitored by an agent. The recognition accuracy on benchmark datasets is comparable to the state-of-the-art even though our model uses significantly fewer parameters and infers the state in a localized manner. The learned policy reduces number of communications. The agent is tolerant to communication failures and can recognize unreliable agents through their communication messages. To the best of our knowledge, this is the first work on learning communication policies by an agent for predicting its environmental state.

Download Full-text

Fake or Genuine? Contextualised Text Representation for Fake Review Detection

10.5121/csit.2021.112311 ◽

2021 ◽

Author(s):

Rami Mohawesh ◽

Shuxiang Xu ◽

Matthew Springer ◽

Muna Al-Hawawreh ◽

Sumbal Maqsood

Keyword(s):

State Of The Art ◽

Online Reviews ◽

Learning Approaches ◽

Text Representation ◽

Purchasing Decisions ◽

Linguistic Features ◽

Proposed Model ◽

Benchmark Datasets ◽

Hidden Patterns ◽

Fake Reviews

Online reviews have a significant influence on customers' purchasing decisions for any products or services. However, fake reviews can mislead both consumers and companies. Several models have been developed to detect fake reviews using machine learning approaches. Many of these models have some limitations resulting in low accuracy in distinguishing between fake and genuine reviews. These models focused only on linguistic features to detect fake reviews and failed to capture the semantic meaning of the reviews. To deal with this, this paper proposes a new ensemble model that employs transformer architecture to discover the hidden patterns in a sequence of fake reviews and detect them precisely. The proposed approach combines three transformer models to improve the robustness of fake and genuine behaviour profiling and modelling to detect fake reviews. The experimental results using semi-real benchmark datasets showed the superiority of the proposed model over state-of-the-art models.

Download Full-text

Latent Emotion Memory for Multi-Label Emotion Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6271 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7692-7699

Author(s):

Hao Fei ◽

Yue Zhang ◽

Yafeng Ren ◽

Donghong Ji

Keyword(s):

State Of The Art ◽

Research Topic ◽

Context Information ◽

Important Research ◽

Emotion Classification ◽

Proposed Model ◽

Important Research Topic ◽

Benchmark Datasets ◽

Memory Network ◽

Emotion Memory

Identifying multiple emotions in a sentence is an important research topic. Existing methods usually model the problem as multi-label classification task. However, previous methods have two issues, limiting the performance of the task. First, these models do not consider prior emotion distribution in a sentence. Second, they fail to effectively capture the context information closely related to the corresponding emotion. In this paper, we propose a Latent Emotion Memory network (LEM) for multi-label emotion classification. The proposed model can learn the latent emotion distribution without external knowledge, and can effectively leverage it into the classification network. Experimental results on two benchmark datasets show that the proposed model outperforms strong baselines, achieving the state-of-the-art performance.

Download Full-text