image understanding
Recently Published Documents


TOTAL DOCUMENTS

852
(FIVE YEARS 110)

H-INDEX

38
(FIVE YEARS 7)

Author(s):  
Santosh Kumar Mishra ◽  
Gaurav Rai ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.


2022 ◽  
Author(s):  
Elissa M Aminoff ◽  
Shira Baror ◽  
Eric W Roginek ◽  
Daniel D Leeds

Contextual associations facilitate object recognition in human vision. However, the role of context in artificial vision remains elusive as does the characteristics that humans use to define context. We investigated whether contextually related objects (bicycle-helmet) are represented more similarly in convolutional neural networks (CNNs) used for image understanding than unrelated objects (bicycle-fork). Stimuli were of objects against a white background and consisted of a diverse set of contexts (N=73). CNN representations of contextually related objects were more similar to one another than to unrelated objects across all CNN layers. Critically, the similarity found in CNNs correlated with human behavior across three experiments assessing contextual relatedness, emerging significant only in the later layers. The results demonstrate that context is inherently represented in CNNs as a result of object recognition training, and that the representation in the later layers of the network tap into the contextual regularities that predict human behavior.


Author(s):  
N. A. Muhadi ◽  
A. F. Abdullah ◽  
S. K. Bejo ◽  
M. R. Mahadi ◽  
A. Mijic

Abstract. Floods are the most frequent type of natural disaster that cause loss of life and damages to personal property and eventually affect the economic state of the country. Researchers around the world have been made significant efforts in dealing with the flood issue. Computer vision is one of the common approaches being employed which include the use of image segmentation techniques for image understanding and image analysis. The technique has been used in various fields including in flood disaster applications. This paper explores the use of a hybrid segmentation technique in detecting water regions from surveillance images and introduces a flood index calculation to study water level fluctuations. The flood index was evaluated by comparing the result with water level measured by sensor on-site. The experimental results demonstrated that the flood index reflects the trend of water levels of the river. Thus, the proposed technique can be used in detecting water regions and monitoring the water level fluctuation of the river.


2022 ◽  
Vol 14 (1) ◽  
pp. 207
Author(s):  
Xudong Sun ◽  
Min Xia ◽  
Tianfang Dai

High-resolution remote sensing images have been put into the application in remote sensing parsing. General remote sensing parsing methods based on semantic segmentation still have limitations, which include frequent neglect of tiny objects, high complexity in image understanding and sample imbalance. Therefore, a controllable fusion module (CFM) is proposed to alleviate the problem of implicit understanding of complicated categories. Moreover, an adaptive edge loss function (AEL) was proposed to alleviate the problem of the recognition of tiny objects and sample imbalance. Our proposed method combining CFM and AEL optimizes edge features and body features in a coupled mode. The verification on Potsdam and Vaihingen datasets shows that our method can significantly improve the parsing effect of satellite images in terms of mIoU and MPA.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Lin Feng ◽  
Jian Wang ◽  
Chao Ding

Digital image processing technology is widely used in production and life, and digital images play a pivotal role in the ever-changing technological development. Noise can affect the expression of image information. The edge is the reflection of the main structure and contour of the image, and it is also the direct interpretation of image understanding and the basis for further segmentation and recognition. Therefore, suppressing noise and improving the accuracy of edge detection are important aspects of image processing. To address these issues, this paper presents a new detection algorithm combined with information fusion based on the existing image edge detection techniques, and the algorithm is studied from two aspects of fuzzy radial basis fusion discrimination, in terms of preprocessing algorithm, comparing the denoising effect of mean and median filters with different template sizes on paper images with added noise, and selecting the improved median filter denoising, comparing different operator edge detection. The effect of image edge detection contour is finally selected as the 3 ∗ 3 Sobel operator for edge detection; the binarized image edge detection contour information is found as the minimum outer rectangle and labeled, and then, the original paper image is scanned line by line to segment the target image edge region. The image edge detection algorithm based on fuzzy radial basis fuser can not only speed up the image preprocessing, meet the real-time detection, and reduce the amount of data processed by the upper computer but also can accurately identify five image edge problems including folds and cracks, which has good application prospects.


2021 ◽  
Vol 13 (24) ◽  
pp. 4999
Author(s):  
Boyong He ◽  
Xianjiang Li ◽  
Bo Huang ◽  
Enhui Gu ◽  
Weijie Guo ◽  
...  

As a data-driven approach, deep learning requires a large amount of annotated data for training to obtain a sufficiently accurate and generalized model, especially in the field of computer vision. However, when compared with generic object recognition datasets, aerial image datasets are more challenging to acquire and more expensive to label. Obtaining a large amount of high-quality aerial image data for object recognition and image understanding is an urgent problem. Existing studies show that synthetic data can effectively reduce the amount of training data required. Therefore, in this paper, we propose the first synthetic aerial image dataset for ship recognition, called UnityShip. This dataset contains over 100,000 synthetic images and 194,054 ship instances, including 79 different ship models in ten categories and six different large virtual scenes with different time periods, weather environments, and altitudes. The annotations include environmental information, instance-level horizontal bounding boxes, oriented bounding boxes, and the type and ID of each ship. This provides the basis for object detection, oriented object detection, fine-grained recognition, and scene recognition. To investigate the applications of UnityShip, the synthetic data were validated for model pre-training and data augmentation using three different object detection algorithms and six existing real-world ship detection datasets. Our experimental results show that for small-sized and medium-sized real-world datasets, the synthetic data achieve an improvement in model pre-training and data augmentation, showing the value and potential of synthetic data in aerial image recognition and understanding tasks.


2021 ◽  
Vol 11 (21) ◽  
pp. 10327
Author(s):  
Ali Abbas ◽  
Michael Haslgrübler ◽  
Abdul Mannan Dogar ◽  
Alois Ferscha

Deep learning has proven to be very useful for the image understanding in efficient manners. Assembly of complex machines is very common in industries. The assembly of automated teller machines (ATM) is one of the examples. There exist deep learning models which monitor and control the assembly process. To the best of our knowledge, there exists no deep learning models for real environments where we have no control over the working style of workers and the sequence of assembly process. In this paper, we presented a modified deep learning model to control the assembly process in a real-world environment. For this study, we have a dataset which was generated in a real-world uncontrolled environment. During the dataset generation, we did not have any control over the sequence of assembly steps. We applied four different states of the art deep learning models to control the assembly of ATM. Due to the nature of uncontrolled environment dataset, we modified the deep learning models to fit for the task. We not only control the sequence, our proposed model will give feedback in case of any missing step in the required workflow. The contributions of this research are accurate anomaly detection in the assembly process in a real environment, modifications in existing deep learning models according to the nature of the data and normalization of the uncontrolled data for the training of deep learning model. The results show that we can generalize and control the sequence of assembly steps, because even in an uncontrolled environment, there are some specific activities, which are repeated over time. If we can recognize and map the micro activities to macro activities, then we can successfully monitor and optimize the assembly process.


Author(s):  
A. V. N. Kameswari

Abstract: When humans see an image, their brain can easily tell what the image is about, but a computer cannot do it easily. Computer vision researchers worked on this a lot and they considered it impossible until now! With the advancement in Deep learning techniques, availability of huge datasets and computer power, we can build models that can generate captions for an image. Image Caption Generator is a popular research area of Deep Learning that deals with image understanding and a language description for that image. Generating well-formed sentences requires both syntactic and semantic understanding of the language. Being able to describe the content of an image using accurately formed sentences is a very challenging task, but it could also have a great impact, by helping visually impaired people better understand the content of images. The biggest challenge is most definitely being able to create a description that must capture not only the objects contained in an image, but also express how these objects relate to each other. This paper uses Flickr_8K dataset and Flickr8k_text folder that contains Flickr8k.token which is the main file of our dataset that contains image name and their respective caption separated by newline(“\n”). CNN is used for extracting features from the image. We will use the pre-trained model Xception. LSTM will use the information from CNN to help generate a description of the image. In our Flickr8k_text folder, we have Flickr_8k.trainImages.txt file that contains a list of 6000 images names that we will use for training. After CNN-LSTM model is defined we give an image file as parameter through command prompt for testing image caption generator and it generates the caption of an image and its accuracy is observed by calculating bleu score for generated and reference captions. Keywords: Image Caption Generator, Convolutional Neural Network, Long Short-Term Memory, Bleu score, Flickr_8K


Author(s):  
Yung-Yao Chen ◽  
Sin-Ye Jhong ◽  
Chih-Hsien Hsia ◽  
Kai-Lung Hua

Recently, as one of the most promising biometric traits, the vein has attracted the attention of both academia and industry because of its living body identification and the convenience of the acquisition process. State-of-the-art techniques can provide relatively good performance, yet they are limited to specific light sources. Besides, it still has poor adaptability to multispectral images. Despite the great success achieved by convolutional neural networks (CNNs) in various image understanding tasks, they often require large training samples and high computation that are infeasible for palm-vein identification. To address this limitation, this work proposes a palm-vein identification system based on lightweight CNN and adaptive multi-spectral method with explainable AI. The principal component analysis on symmetric discrete wavelet transform (SMDWT-PCA) technique for vein images augmentation method is adopted to solve the problem of insufficient data and multispectral adaptability. The depth separable convolution (DSC) has been applied to reduce the number of model parameters in this work. To ensure that the experimental result demonstrates accurately and robustly, a multispectral palm image of the public dataset (CASIA) is also used to assess the performance of the proposed method. As result, the palm-vein identification system can provide superior performance to that of the former related approaches for different spectrums.


Sign in / Sign up

Export Citation Format

Share Document