A Deep Structured Model for Video Captioning

Author(s):  
V. Vinodhini ◽  
B. Sathiyabhama ◽  
S. Sankar ◽  
Ramasubbareddy Somula

Video captions help people to understand in a noisy environment or when the sound is muted. It helps people having impaired hearing to understand much better. Captions not only support the content creators and translators but also boost the search engine optimization. Many advanced areas like computer vision and human-computer interaction play a vital role as there is a successful growth of deep learning techniques. Numerous surveys on deep learning models are evolved with different methods, architecture, and metrics. Working with video subtitles is still challenging in terms of activity recognition in video. This paper proposes a deep structured model that is effective towards activity recognition, automatically classifies and caption it in a single architecture. The first process includes subtracting the foreground from the background; this is done by building a 3D convolutional neural network (CNN) model. A Gaussian mixture model is used to remove the backdrop. The classification is done using long short-term memory networks (LSTM). A hidden Markov model (HMM) is used to generate the high quality data. Next, it uses the nonlinear activation function to perform the normalization process. Finally, the video captioning is achieved by using natural language.

Sensors ◽  
2019 ◽  
Vol 19 (15) ◽  
pp. 3434 ◽  
Author(s):  
Nattaya Mairittha ◽  
Tittaya Mairittha ◽  
Sozo Inoue

Labeling activity data is a central part of the design and evaluation of human activity recognition systems. The performance of the systems greatly depends on the quantity and “quality” of annotations; therefore, it is inevitable to rely on users and to keep them motivated to provide activity labels. While mobile and embedded devices are increasingly using deep learning models to infer user context, we propose to exploit on-device deep learning inference using a long short-term memory (LSTM)-based method to alleviate the labeling effort and ground truth data collection in activity recognition systems using smartphone sensors. The novel idea behind this is that estimated activities are used as feedback for motivating users to collect accurate activity labels. To enable us to perform evaluations, we conduct the experiments with two conditional methods. We compare the proposed method showing estimated activities using on-device deep learning inference with the traditional method showing sentences without estimated activities through smartphone notifications. By evaluating with the dataset gathered, the results show our proposed method has improvements in both data quality (i.e., the performance of a classification model) and data quantity (i.e., the number of data collected) that reflect our method could improve activity data collection, which can enhance human activity recognition systems. We discuss the results, limitations, challenges, and implications for on-device deep learning inference that support activity data collection. Also, we publish the preliminary dataset collected to the research community for activity recognition.


Sensors ◽  
2020 ◽  
Vol 20 (24) ◽  
pp. 7195
Author(s):  
Yashi Nan ◽  
Nigel H. Lovell ◽  
Stephen J. Redmond ◽  
Kejia Wang ◽  
Kim Delbaere ◽  
...  

Activity recognition can provide useful information about an older individual’s activity level and encourage older people to become more active to live longer in good health. This study aimed to develop an activity recognition algorithm for smartphone accelerometry data of older people. Deep learning algorithms, including convolutional neural network (CNN) and long short-term memory (LSTM), were evaluated in this study. Smartphone accelerometry data of free-living activities, performed by 53 older people (83.8 ± 3.8 years; 38 male) under standardized circumstances, were classified into lying, sitting, standing, transition, walking, walking upstairs, and walking downstairs. A 1D CNN, a multichannel CNN, a CNN-LSTM, and a multichannel CNN-LSTM model were tested. The models were compared on accuracy and computational efficiency. Results show that the multichannel CNN-LSTM model achieved the best classification results, with an 81.1% accuracy and an acceptable model and time complexity. Specifically, the accuracy was 67.0% for lying, 70.7% for sitting, 88.4% for standing, 78.2% for transitions, 88.7% for walking, 65.7% for walking downstairs, and 68.7% for walking upstairs. The findings indicated that the multichannel CNN-LSTM model was feasible for smartphone-based activity recognition in older people.


Atmosphere ◽  
2020 ◽  
Vol 11 (5) ◽  
pp. 487 ◽  
Author(s):  
Trang Thi Kieu Tran ◽  
Taesam Lee ◽  
Ju-Young Shin ◽  
Jong-Suk Kim ◽  
Mohamad Kamruzzaman

Time series forecasting of meteorological variables such as daily temperature has recently drawn considerable attention from researchers to address the limitations of traditional forecasting models. However, a middle-range (e.g., 5–20 days) forecasting is an extremely challenging task to get reliable forecasting results from a dynamical weather model. Nevertheless, it is challenging to develop and select an accurate time-series prediction model because it involves training various distinct models to find the best among them. In addition, selecting an optimum topology for the selected models is important too. The accurate forecasting of maximum temperature plays a vital role in human life as well as many sectors such as agriculture and industry. The increase in temperature will deteriorate the highland urban heat, especially in summer, and have a significant influence on people’s health. We applied meta-learning principles to optimize the deep learning network structure for hyperparameter optimization. In particular, the genetic algorithm (GA) for meta-learning was used to select the optimum architecture for the network used. The dataset was used to train and test three different models, namely the artificial neural network (ANN), recurrent neural network (RNN), and long short-term memory (LSTM). Our results demonstrate that the hybrid model of an LSTM network and GA outperforms other models for the long lead time forecasting. Specifically, LSTM forecasts have superiority over RNN and ANN for 15-day-ahead in summer with the root mean square error (RMSE) value of 2.719 (°C).


2020 ◽  
Vol 2020 ◽  
pp. 1-12 ◽  
Author(s):  
Huaijun Wang ◽  
Jing Zhao ◽  
Junhuai Li ◽  
Ling Tian ◽  
Pengjia Tu ◽  
...  

Human activity recognition (HAR) can be exploited to great benefits in many applications, including elder care, health care, rehabilitation, entertainment, and monitoring. Many existing techniques, such as deep learning, have been developed for specific activity recognition, but little for the recognition of the transitions between activities. This work proposes a deep learning based scheme that can recognize both specific activities and the transitions between two different activities of short duration and low frequency for health care applications. In this work, we first build a deep convolutional neural network (CNN) for extracting features from the data collected by sensors. Then, the long short-term memory (LTSM) network is used to capture long-term dependencies between two actions to further improve the HAR identification rate. By combing CNN and LSTM, a wearable sensor based model is proposed that can accurately recognize activities and their transitions. The experimental results show that the proposed approach can help improve the recognition rate up to 95.87% and the recognition rate for transitions higher than 80%, which are better than those of most existing similar models over the open HAPT dataset.


Author(s):  
Vũ Hữu Tiến ◽  
Thao Nguyen Thi Huong ◽  
San Vu Van ◽  
Xiem HoangVan

Transform domain Wyner-Ziv video coding (TDWZ) has shown its benefits in compressing video applications with limited resources such as visual surveillance systems, remote sensing and wireless sensor networks. In TDWZ, the correlation noise model (CNM) plays a vital role since it directly affects to the number of bits needed to send from the encoder and thus the overall TDWZ compression performance. To achieve CNM with high accurate for TDWZ, we propose in this paper a novel CNM estimation approach in which the CNM with Laplacian distribution is adaptively estimated based on a deep learning (DL) mechanism. The proposed DL based CNM includes two hidden layers and a linear activation function to adaptively update the Laplacian parameter. Experimental results showed that the proposed TDWZ codec significantly outperforms the relevant benchmarks, notably by around 35% bitrate saving when compared to the DISCOVER codec and around 22% bitrate saving when compared to the HEVC Intra benchmark while providing a similar perceptual quality.


Sensors ◽  
2019 ◽  
Vol 19 (7) ◽  
pp. 1716 ◽  
Author(s):  
Seungeun Chung ◽  
Jiyoun Lim ◽  
Kyoung Ju Noh ◽  
Gague Kim ◽  
Hyuntae Jeong

In this paper, we perform a systematic study about the on-body sensor positioning and data acquisition details for Human Activity Recognition (HAR) systems. We build a testbed that consists of eight body-worn Inertial Measurement Units (IMU) sensors and an Android mobile device for activity data collection. We develop a Long Short-Term Memory (LSTM) network framework to support training of a deep learning model on human activity data, which is acquired in both real-world and controlled environments. From the experiment results, we identify that activity data with sampling rate as low as 10 Hz from four sensors at both sides of wrists, right ankle, and waist is sufficient in recognizing Activities of Daily Living (ADLs) including eating and driving activity. We adopt a two-level ensemble model to combine class-probabilities of multiple sensor modalities, and demonstrate that a classifier-level sensor fusion technique can improve the classification performance. By analyzing the accuracy of each sensor on different types of activity, we elaborate custom weights for multimodal sensor fusion that reflect the characteristic of individual activities.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Veerraju Gampala ◽  
Praful Vijay Nandankar ◽  
M. Kathiravan ◽  
S. Karunakaran ◽  
Arun Reddy Nalla ◽  
...  

Purpose The purpose of this paper is to analyze and build a deep learning model that can furnish statistics of COVID-19 and is able to forecast pandemic outbreak using Kaggle open research COVID-19 data set. As COVID-19 has an up-to-date data collection from the government, deep learning techniques can be used to predict future outbreak of coronavirus. The existing long short-term memory (LSTM) model is fine-tuned to forecast the outbreak of COVID-19 with better accuracy, and an empirical data exploration with advanced picturing has been made to comprehend the outbreak of coronavirus. Design/methodology/approach This research work presents a fine-tuned LSTM deep learning model using three hidden layers, 200 LSTM unit cells, one activation function ReLu, Adam optimizer, loss function is mean square error, the number of epochs 200 and finally one dense layer to predict one value each time. Findings LSTM is found to be more effective in forecasting future predictions. Hence, fine-tuned LSTM model predicts accurate results when applied to COVID-19 data set. Originality/value The fine-tuned LSTM model is developed and tested for the first time on COVID-19 data set to forecast outbreak of pandemic according to the authors’ knowledge.


Water ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 3399
Author(s):  
Sang-Soo Baek ◽  
Jongcheol Pyo ◽  
Jong Ahn Chun

A Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) combined with a deep learning approach was created by combining CNN and LSTM networks simulated water quality including total nitrogen, total phosphorous, and total organic carbon. Water level and water quality data in the Nakdong river basin were collected from the Water Resources Management Information System (WAMIS) and the Real-Time Water Quality Information, respectively. The rainfall radar image and operation information of estuary barrage were also collected from the Korea Meteorological Administration. In this study, CNN was used to simulate the water level and LSTM used for water quality. The entire simulation period was 1 January 2016–16 November 2017 and divided into two parts: (1) calibration (1 January 2016–1 March 2017); and (2) validation (2 March 2017–16 November 2017). This study revealed that the performances of both of the CNN and LSTM models were in the “very good” range with above the Nash–Sutcliffe efficiency value of 0.75 and that those models well represented the temporal variations of the pollutants in Nakdong river basin (NRB). It is concluded that the proposed approach in this study can be useful to accurately simulate the water level and water quality.


Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2498 ◽  
Author(s):  
Robert D. Chambers ◽  
Nathanael C. Yoder

In this paper, we present and benchmark FilterNet, a flexible deep learning architecture for time series classification tasks, such as activity recognition via multichannel sensor data. It adapts popular convolutional neural network (CNN) and long short-term memory (LSTM) motifs which have excelled in activity recognition benchmarks, implementing them in a many-to-many architecture to markedly improve frame-by-frame accuracy, event segmentation accuracy, model size, and computational efficiency. We propose several model variants, evaluate them alongside other published models using the Opportunity benchmark dataset, demonstrate the effect of model ensembling and of altering key parameters, and quantify the quality of the models’ segmentation of discrete events. We also offer recommendations for use and suggest potential model extensions. FilterNet advances the state of the art in all measured accuracy and speed metrics when applied to the benchmarked dataset, and it can be extensively customized for other applications.


Sign in / Sign up

Export Citation Format

Share Document