Handwritten Annotation Spotting in Printed Documents Using Top-Down Visual Saliency Models

Shilpa Pandey ◽  
Gaurav Harit

In this article, we address the problem of localizing text and symbolic annotations on the scanned image of a printed document. Previous approaches have considered the task of annotation extraction as binary classification into printed and handwritten text. In this work, we further subcategorize the annotations as underlines, encirclements, inline text, and marginal text. We have collected a new dataset of 300 documents constituting all classes of annotations marked around or in-between printed text. Using the dataset as a benchmark, we report the results of two saliency formulations—CRF Saliency and Discriminant Saliency, for predicting salient patches, which can correspond to different types of annotations. We also compare our work with recent semantic segmentation techniques using deep models. Our analysis shows that Discriminant Saliency can be considered as the preferred approach for fast localization of patches containing different types of annotations. The saliency models were learned on a small dataset, but still, give comparable performance to the deep networks for pixel-level semantic segmentation. We show that saliency-based methods give better outcomes with limited annotated data compared to more sophisticated segmentation techniques that require a large training set to learn the model.

2019 ◽  
Vol 9 (24) ◽  
pp. 5378 ◽  
Maria Wahid ◽  
Asim Waris ◽  
Syed Omer Gilani ◽  
Ramanathan Subramanian

Saliency is the quality of an object that makes it stands out from neighbouring items and grabs viewer attention. Regarding image processing, it refers to the pixel or group of pixels that stand out in an image or a video clip and capture the attention of the viewer. Our eye movements are usually guided by saliency while inspecting a scene. Rapid detection of emotive stimuli an ability possessed by humans. Visual objects in a scene are also emotionally salient. As different images and clips can elicit different emotional responses in a viewer such as happiness or sadness, there is a need to measure these emotions along with visual saliency. This study was conducted to determine whether the existing available visual saliency models can also measure emotional saliency. A classical Graph-Based Visual Saliency (GBVS) model is used in the study. Results show that there is low saliency or salient features in sad movies with at least a significant difference of 0.05 between happy and sad videos as well as a large mean difference of 76.57 and 57.0, hence making these videos less emotionally salient. However, overall visual content does not capture emotional salience. The applied Graph-Based Visual Saliency model notably identified happy emotions but could not analyze sad emotions.

Y. Cao ◽  
M. Previtali ◽  
M. Scaioni

Abstract. In the wake of the success of Deep Learning Networks (DLN) for image recognition, object detection, shape classification and semantic segmentation, this approach has proven to be both a major breakthrough and an excellent tool in point cloud classification. However, understanding how different types of DLN achieve still lacks. In several studies the output of segmentation/classification process is compared against benchmarks, but the network is treated as a “black-box” and intermediate steps are not deeply analysed. Specifically, here the following questions are discussed: (1) what exactly did DLN learn from a point cloud? (2) On the basis of what information do DLN make decisions? To conduct such a quantitative investigation of these DLN applied to point clouds, this paper investigates the visual interpretability for the decision-making process. Firstly, we introduce a reconstruction network able to reconstruct and visualise the learned features, in order to face with question (1). Then, we propose 3DCAM to indicate the discriminative point cloud regions used by these networks to identify that category, thus dealing with question (2). Through answering the above two questions, the paper would like to offer some initial solutions to better understand the application of DLN to point clouds.

Jingtan Li ◽  
Maolin Xu ◽  
Hongling Xiu

With the resolution of remote sensing images is getting higher and higher, high-resolution remote sensing images are widely used in many areas. Among them, image information extraction is one of the basic applications of remote sensing images. In the face of massive high-resolution remote sensing image data, the traditional method of target recognition is difficult to cope with. Therefore, this paper proposes a remote sensing image extraction based on U-net network. Firstly, the U-net semantic segmentation network is used to train the training set, and the validation set is used to verify the training set at the same time, and finally the test set is used for testing. The experimental results show that U-net can be applied to the extraction of buildings.

2020 ◽  
Vol 10 (18) ◽  
pp. 6386
Xing Bai ◽  
Jun Zhou

Benefiting from the booming of deep learning, the state-of-the-art models achieved great progress. But they are huge in terms of parameters and floating point operations, which makes it hard to apply them to real-time applications. In this paper, we propose a novel deep neural network architecture, named MPDNet, for fast and efficient semantic segmentation under resource constraints. First, we use a light-weight classification model pretrained on ImageNet as the encoder. Second, we use a cost-effective upsampling datapath to restore prediction resolution and convert features for classification into features for segmentation. Finally, we propose to use a multi-path decoder to extract different types of features, which are not ideal to process inside only one convolutional neural network. The experimental results of our model outperform other models aiming at real-time semantic segmentation on Cityscapes. Based on our proposed MPDNet, we achieve 76.7% mean IoU on Cityscapes test set with only 118.84GFLOPs and achieves 37.6 Hz on 768 × 1536 images on a standard GPU.

2019 ◽  
Vol 10 (1) ◽  
pp. 13 ◽  
Shichao Zhang ◽  
Zhe Zhang ◽  
Libo Sun ◽  
Wenhu Qin

Generally, most approaches using methods such as cropping, rotating, and flipping achieve more data to train models for improving the accuracy of detection and segmentation. However, due to the difficulties of labeling such data especially semantic segmentation data, those traditional data augmentation methodologies cannot help a lot when the training set is really limited. In this paper, a model named OFA-Net (One For All Network) is proposed to combine object detection and semantic segmentation tasks. Meanwhile, using a strategy called “1-N Alternation” to train the OFA-Net model, which can make a fusion of features from detection and segmentation data. The results show that object detection data can be recruited to better the segmentation accuracy performance, and furthermore, segmentation data assist a lot to enhance the confidence of predictions for object detection. Finally, the OFA-Net model is trained without traditional data augmentation methodologies and tested on the KITTI test server. The model works well on the KITTI Road Segmentation challenge and can do a good job on the object detection task.

Sensors ◽  
2020 ◽  
Vol 20 (18) ◽  
pp. 5318
Dongnian Li ◽  
Changming Li ◽  
Chengjun Chen ◽  
Zhengxu Zhao

Locating and identifying the components mounted on a printed circuit board (PCB) based on machine vision is an important and challenging problem for automated PCB inspection and automated PCB recycling. In this paper, we propose a PCB semantic segmentation method based on depth images that segments and recognizes components in the PCB through pixel classification. The image training set for the PCB was automatically synthesized with graphic rendering. Based on a series of concentric circles centered at the given depth pixel, we extracted the depth difference features from the depth images in the training set to train a random forest pixel classifier. By using the constructed random forest pixel classifier, we performed semantic segmentation for the PCB to segment and recognize components in the PCB through pixel classification. Experiments on both synthetic and real test sets were conducted to verify the effectiveness of the proposed method. The experimental results demonstrate that our method can segment and recognize most of the components from a real depth image of the PCB. Our method is immune to illumination changes and can be implemented in parallel on a GPU.

Information ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 257 ◽  
Bashir Ghariba ◽  
Mohamed S. Shehata ◽  
Peter McGuire

Human eye movement is one of the most important functions for understanding our surroundings. When a human eye processes a scene, it quickly focuses on dominant parts of the scene, commonly known as a visual saliency detection or visual attention prediction. Recently, neural networks have been used to predict visual saliency. This paper proposes a deep learning encoder-decoder architecture, based on a transfer learning technique, to predict visual saliency. In the proposed model, visual features are extracted through convolutional layers from raw images to predict visual saliency. In addition, the proposed model uses the VGG-16 network for semantic segmentation, which uses a pixel classification layer to predict the categorical label for every pixel in an input image. The proposed model is applied to several datasets, including TORONTO, MIT300, MIT1003, and DUT-OMRON, to illustrate its efficiency. The results of the proposed model are quantitatively and qualitatively compared to classic and state-of-the-art deep learning models. Using the proposed deep learning model, a global accuracy of up to 96.22% is achieved for the prediction of visual saliency.

Sign in / Sign up

Export Citation Format

Share Document