bag of words
Recently Published Documents


TOTAL DOCUMENTS

648
(FIVE YEARS 185)

H-INDEX

28
(FIVE YEARS 4)

2022 ◽  
Author(s):  
Muhammad Shaheer Mirza ◽  
Sheikh Muhammad Munaf ◽  
Shahid Ali ◽  
Fahad Azim ◽  
Saad Jawaid Khan

Abstract In order to perform their daily activities, a person is required to communicating with others. This can be a major obstacle for the deaf population of the world, who communicate using sign languages (SL). Pakistani Sign Language (PSL) is used by more than 250,000 deaf Pakistanis. Developing a SL recognition system would greatly facilitate these people. This study aimed to collect data of static and dynamic PSL alphabets and to develop a vision-based system for their recognition using Bag-of-Words (BoW) and Support Vector Machine (SVM) techniques. A total of 5,120 images for 36 static PSL alphabet signs and 353 videos with 45,224 frames for 3 dynamic PSL alphabet signs were collected from 10 native signers of PSL. The developed system used the collected data as input, resized the data to various scales and converted the RGB images into grayscale. The resized grayscale images were segmented using Thresholding technique and features were extracted using Speeded Up Robust Feature (SURF). The obtained SURF descriptors were clustered using K-means clustering. A BoW was obtained by computing the Euclidean distance between the SURF descriptors and the clustered data. The codebooks were divided into training and testing using 5-fold cross validation. The highest overall classification accuracy for static PSL signs was 97.80% at 750×750 image dimensions and 500 Bags. For dynamic PSL signs a 96.53% accuracy was obtained at 480×270 video resolution and 200 Bags.


Author(s):  
Manuel-Alejandro Sánchez-Fernández ◽  
Alfonso Medina-Urrea ◽  
Juan-Manuel Torres-Moreno

The present work aims to study the relationship between measures, obtained from Latent Semantic Analysis (LSA) and a variant known as SPAN, and activation and identifiability states (Informative States) of referents in noun phrases present in journalistic notes from Northwestern Mexican news outlets written in Spanish. The aim and challenge is to find a strategy to achieve labelling of new / given information in the discourse rooted in a theoretically linguistic stance. The new / given distinction can be defined from different perspectives in which it varies what linguistic forms are taken into account. Thus, the focus in this work is to work with full referential devices (n = 2 388). Pearson’s R correlation tests, analysis of variance, graphical exploration of the clustering of labels, and a classification experiment with random forests are performed. For the experiment, two groups were used: noun phrases labeled with all 10 tags of informative states and a binary labelling, as well as the use of two bags-of-words for each noun phrase: the interior and the exterior. It was found that using LSA in conjunction with the inner bag of words can be used to classify certain informational states. This same measure showed good results for the binary division, detecting which sentences introduce new referents in discourse. In previous work using a similar method in noun phrases in English, 80% accuracy (n = 478) was reached in their classification exercise. Our best test for Spanish reached 79%. No work on Spanish using this method has been done before and this kind of experiment is important because Spanish exhibits a more complex inflectional morphology.


Author(s):  
O. Harbuzenko ◽  
O. Piatykop
Keyword(s):  

Дана робота присвячена дослідженню ефективного визначення настрою англомовних постів з соціальних мереж, що базується на перетворенні слів у векторні представлення за допомогою методу Word2Vec. У роботі описані та проаналізовані існуючі методи сентимент аналізу, проаналізовано моделі Continuous Bag of Words (CBOW) та Skip-gram у складі методу Word2Vec, проведено порівняння їх властивостей при засвоєнні семантичних зв’язків між словами природної мови. Описано експериментальне дослідження щодо використання зазначених моделей при різних функціях тренування.


Electronics ◽  
2021 ◽  
Vol 10 (24) ◽  
pp. 3106
Author(s):  
Tingting Han ◽  
Yuankai Qi ◽  
Suguo Zhu

Video compact representation aims to obtain a representation that could reflect the kernel mode of video content and concisely describe the video. As most information in complex videos is either noisy or redundant, some researchers have instead focused on long-term video semantics. Recent video compact representation methods heavily rely on the segmentation accuracy of video semantics. In this paper, we propose a novel framework to address these challenges. Specifically, we designed a novel continuous video semantic embedding model to learn the actual distribution of video words. First, an embedding model based on the continuous bag of words method is proposed to learn the video embeddings, integrated with a well-designed discriminative negative sampling approach, which helps emphasize the convincing clips in the embedding while weakening the influence of the confusing ones. Second, an aggregated distribution pooling method is proposed to capture the semantic distribution of kernel modes in videos. Finally, our well-trained model can generate compact video representations by direct inference, which provides our model with a better generalization ability compared with those of previous methods. We performed extensive experiments on event detection and the mining of representative event parts. Experiments on TRECVID MED11 and CCV datasets demonstrated the effectiveness of our method. Our method could capture the semantic distribution of kernel modes in videos and shows powerful potential to discover and better describe complex video patterns.


Author(s):  
Вера Аркадьевна Частикова ◽  
Константин Валерьевич Козачёк

Представлен анализ основных проблем фильтрации почтового спама, современных методов фильтрации нежелательных писем и способов обхода систем защиты. Вводится понятие « легитимного спама » - новой проблемы, с которой сталкиваются пользователи электронной почты. Рассмотрены методы представления текста: bag-of-words и Embedding-пространство, а также методы классификации: искусственные нейронные сети, метод опорных векторов, наивный байесовский классификатор. В работе определены эффективные методы, построенные на анализе текста, для решения задач обнаружения различных видов спама: типичного ( известного системе ) , составленного при помощи методов обхода систем детекции спама, и легитимного. An analysis of the main problems of filtering mail spam, modern methods of filtering unwanted letters and methods of bypassing security systems is presented. The concept of “legitimate spam” is being introduced - a new problem that email users face. Methods of text presentation are considered: bag-of-words and Embedding-space, as well as classification methods: artificial neural networks, the method of reference vectors, naive Bayesian classifier. The work identifies effective methods based on text analysis, for solving the problems of detecting various types of spam: a typical (known to system), compiled using methods of bypassing spam detection systems, and legitimate.


2021 ◽  
Vol 2089 (1) ◽  
pp. 012049
Author(s):  
Lingala Thirupathi ◽  
Rekha gangula ◽  
Sandeep Ravikanti ◽  
Jujuroo Sowmya ◽  
S K Shruthi

Abstract In these modern times where internet has become widely popular and used by almost everyone, anyone can share or upload articles without any credibility. False news refers to articles that are published with the intent of deliberately misleading readers. In the recent times false news on internet has become more and it has become a major problem as it is difficult to differentiate between the real and the false news. False news and false posts have become more prevalent on social media sites such as Face book and Twitter. From these platforms the news will be spread like wild fire without any authenticity. It can be used to sway election outcomes against certain candidates, can be used for click baiting purposes, and can be used to earn revenue by misleading the users. In this paper we will use natural language processing techniques like bag of words and TD-IDF and machine learning concepts of classification algorithms like SVM and passive aggressive classifier to train our machine to differentiate false news from real news and we will compare the accuracy of methods used to find accurate model.


2021 ◽  
Vol 26 (5) ◽  
pp. 453-460
Author(s):  
Krishna Chythanya Nagaraju ◽  
Cherku Ramesh Kumar Reddy

A reusable code component is the one which can be easily used with a little or no adaptation to fit in to the application being developed. The major concern in such process is the maintenance of these reusable components in one place called ‘Repository’, so that those code components can be effectively identified as well as reused. Word embedding allows us to numerically represent our textual information. They have become so pervasive that almost all Natural Language Processing projects make use of them. In this work, we considered to use Word2Vec concept to find vector representation of features of a reusable component. The features of a reusable component in the form of sequence of words are input to Word2Vec network. Our method using Word2Vec with Continuous Bag of Words out performs existing method in the market. The proposed methodology has shown an accuracy of 94.8% in identifying the existing reusable component.


Author(s):  
Asa Adadey ◽  
Robert Giannini ◽  
Lorraine B. Possanza

Abstract Background Patient safety event reports provide valuable insight into systemic safety issues but deriving insights from these reports requires computational tools to efficiently parse through large volumes of qualitative data. Natural language processing (NLP) combined with predictive learning provides an automated approach to evaluating these data and supporting the work of patient safety analysts. Objectives The objective of this study was to use NLP and machine learning techniques to develop a generalizable, scalable, and reliable approach to classifying event reports for the purpose of driving improvements in the safety and quality of patient care. Methods Datasets for 14 different labels (themes) were vectorized using a bag-of-words, tf-idf, or document embeddings approach and then applied to a series of classification algorithms via a hyperparameter grid search to derive an optimized model. Reports were also analyzed for terms strongly associated with each theme using an adjusted F-score calculation. Results F1 score for each optimized model ranged from 0.951 (“Fall”) to 0.544 (“Environment”). The bag-of-words approach proved optimal for 12 of 14 labels, and the naïve Bayes algorithm performed best for nine labels. Linear support vector machine was demonstrated as optimal for three labels and XGBoost for four of the 14 labels. Labels with more distinctly associated terms performed better than less distinct themes, as shown by a Pearson's correlation coefficient of 0.634. Conclusions We were able to demonstrate an analytical pipeline that broadly applies NLP and predictive modeling to categorize patient safety reports from multiple facilities. This pipeline allows analysts to more rapidly identify and structure information contained in patient safety data, which can enhance the evaluation and the use of this information over time.


2021 ◽  
Author(s):  
Craig Macdonald ◽  
Nicola Tonellotto ◽  
Sean MacAvaney
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document