A Compression-Based Method for Detecting Anomalies in Textual Data

Nowadays, information and communications technology systems are fundamental assets of our social and economical model, and thus they should be properly protected against the malicious activity of cybercriminals. Defence mechanisms are generally articulated around tools that trace and store information in several ways, the simplest one being the generation of plain text files coined as security logs. Such log files are usually inspected, in a semi-automatic way, by security analysts to detect events that may affect system integrity, confidentiality and availability. On this basis, we propose a parameter-free method to detect security incidents from structured text regardless its nature. We use the Normalized Compression Distance to obtain a set of features that can be used by a Support Vector Machine to classify events from a heterogeneous cybersecurity environment. In particular, we explore and validate the application of our method in four different cybersecurity domains: HTTP anomaly identification, spam detection, Domain Generation Algorithms tracking and sentiment analysis. The results obtained show the validity and flexibility of our approach in different security scenarios with a low configuration burden.

Download Full-text

Analisis Sentimen Data Twitter Tentang Pasangan Capres-Cawapres Pemilu 2019 Dengan Metode Lexicon Based Dan Support Vector Machine

Jurnal Ilmiah FIFO ◽

10.22441/fifo.2019.v11i2.004 ◽

2019 ◽

Vol 11 (2) ◽

pp. 144

Author(s):

Danar Wido Seno ◽

Arief Wibowo

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Sentiment Analysis ◽

Vice President ◽

Training Data ◽

Support Vector ◽

New Words ◽

Textual Data ◽

Data Content ◽

Combination Of Methods

Social media writing content growing make a lot of new words that appear on Twitter in the form of words and abbreviations that appear so that sentiment analysis is increasingly difficult to get high accuracy of textual data on Twitter social media. In this study, the authors conducted research on sentiment analysis of the pairs of candidates for President and Vice President of Indonesia in the 2019 Elections. To obtain higher accuracy results and accommodate the problem of textual data development on Twitter, the authors conducted a combination of methods to conduct the sentiment analysis with unsupervised and supervised methods. namely Lexicon Based. This study used Twitter data in October 2018 using the search keywords with the names of each pair of candidates for President and Vice President of the 2019 Elections totaling 800 datasets. From the study with 800 datasets the best accuracy was obtained with a value of 92.5% with 80% training data composition and 20% testing data with a Precision value in each class between 85.7% - 97.2% and Recall value for each class among 78, 2% - 93.5%. With the Lexicon Based method as a labeling dataset, the process of labeling the Support Vector Machine dataset is no longer done manually but is processed by the Lexicon Based method and the dictionary on the lexicon can be added along with the development of data content on Twitter social media.

Download Full-text

SENTIMEN ANALISIS KEBIJAKAN GANJIL GENAP DI TOL BEKASI MENGGUNAKAN ALGORITMA NAIVE BAYES DENGAN OPTIMALISASI INFORMATION GAIN

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v15i2.705 ◽

2019 ◽

Vol 15 (2) ◽

pp. 247-254

Author(s):

Heru Sukma Utama ◽

Didi Rosiyadi ◽

Dedi Aridarma ◽

Bobby Suryo Prakoso

Keyword(s):

Social Media ◽

Opinion Mining ◽

Naive Bayes ◽

Information Gain ◽

Confusion Matrix ◽

Naïve Bayes ◽

Support Vector ◽

Toll Road ◽

Textual Data ◽

Bayes Algorithm

Analysis of the odd even-numbered sentiment systems in Bekasi toll using the Naïve Bayes Algorithm, is a process of understanding, extracting, and processing textual data automatically from social media. The purpose of this study was to determine the level of accuracy, recall and precision of opinion mining generated using the Naïve Bayes algorithm to provide information community sentiment towards the effectiveness of the odd system of Bekasi tiolls on social media. The research method used in this study was to do text mining in comments-comments regarding posts regarding even odd oddities on Bekasi toll on Twitter, Instagram, Youtube and Facebook. The steps taken are starting from preprocessing, transformation, datamining and evaluation, followed by information gaon feature selection, select by weight and applying NB Algorithm model. The results obtained from the study using the NB model are obtained Confusion Matrix result, namely accuracy of 79,55%, Precision of 80,51%, and Sensitivity or Recall of 80,91%. Thus this study concludes that the use of Support Vector Machine Algorithms can analyze even odd sentiments on the Bekasi toll road.

Download Full-text

Research on Digital Forensics Based on Uyghur Web Text Classification

Cyber Warfare and Terrorism ◽

10.4018/978-1-7998-2466-4.ch093 ◽

2020 ◽

pp. 1586-1597

Author(s):

Yasen Aizezi ◽

Anwar Jamal ◽

Ruxianguli Abudurexiti ◽

Mutalipu Muming

Keyword(s):

Mutual Information ◽

Text Classification ◽

Text Categorization ◽

Digital Forensics ◽

Feature Space ◽

Experimental Result ◽

Support Vector ◽

Web Documents ◽

Normalized Mutual Information ◽

Plain Text

This paper mainly discusses the use of mutual information (MI) and Support Vector Machines (SVMs) for Uyghur Web text classification and digital forensics process of web text categorization: automatic classification and identification, conversion and pretreatment of plain text based on encoding features of various existing Uyghur Web documents etc., introduces the pre-paratory work for Uyghur Web text encoding. Focusing on the non-Uyghur characters and stop words in the web texts filtering, we put forward a Multi-feature Space Normalized Mutual Information (M-FNMI) algorithm and replace MI between single feature and category with mutual information (MI) between input feature combination and category so as to extract more accurate feature words; finally, we classify features with support vector machine (SVM) algorithm. The experimental result shows that this scheme has a high precision of classification and can provide criterion for digital forensics with specific purpose.

Download Full-text

Research on Digital Forensics Based on Uyghur Web Text Classification

Digital Forensics and Forensic Investigations ◽

10.4018/978-1-7998-3025-2.ch032 ◽

2020 ◽

pp. 485-496

Author(s):

Yasen Aizezi ◽

Anwar Jamal ◽

Ruxianguli Abudurexiti ◽

Mutalipu Muming

Keyword(s):

Mutual Information ◽

Text Classification ◽

Text Categorization ◽

Digital Forensics ◽

Feature Space ◽

Experimental Result ◽

Support Vector ◽

Web Documents ◽

Normalized Mutual Information ◽

Plain Text

Download Full-text

Ooredoo Rayek

International Journal of Technology Diffusion ◽

10.4018/ijtd.2020040105 ◽

2020 ◽

Vol 11 (2) ◽

pp. 66-81

Author(s):

Badia Klouche ◽

Sidi Mohamed Benslimane ◽

Sakina Rim Bennabi

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Text Mining ◽

Sentiment Analysis ◽

Experimental Results ◽

Support Vector ◽

Textual Data ◽

New Strategy ◽

Set Up

Sentiment analysis is one of the recent areas of emerging research in the classification of sentiment polarity and text mining, particularly with the considerable number of opinions available on social media. The Algerian Operator Telephone Ooredoo, as other operators, deploys in its new strategy to conquer new customers, by exploiting their opinions through a sentiments analysis. The purpose of this work is to set up a system called “Ooredoo Rayek”, whose objective is to collect, transliterate, translate and classify the textual data expressed by the Ooredoo operator's customers. This article developed a set of rules allowing the transliteration from Algerian Arabizi to Algerian dialect. Furthermore, the authors used Naïve Bayes (NB) and (Support Vector Machine) SVM classifiers to assign polarity tags to Facebook comments from the official pages of Ooredoo written in multilingual and multi-dialect context. Experimental results show that the system obtains good performance with 83% of accuracy.

Download Full-text

Predicting the Helpfulness of Online Restaurant Reviews Using Different Machine Learning Algorithms: A Case Study of Yelp

Sustainability ◽

10.3390/su11195254 ◽

2019 ◽

Vol 11 (19) ◽

pp. 5254 ◽

Cited By ~ 6

Author(s):

Yi Luo ◽

Xiaowei Xu

Keyword(s):

Latent Dirichlet Allocation ◽

Online Reviews ◽

Restaurant Industry ◽

Machine Learning Algorithms ◽

Support Vector ◽

Plain Text ◽

Sustainable Marketing ◽

Restaurant Reviews ◽

Negative Sentiment ◽

Sustainable Competitive Advantages

Helpful online reviews could be utilized to create sustainable marketing strategies in the restaurant industry, which contributes to national sustainable economic development. This study, the main aspects (including food/taste, experience, location, and value) from 294,034 reviews on Yelp.com were extracted empirically using the Latent Dirichlet Allocation (LDA) and positive and negative sentiment were assigned to each extracted aspect. Positive sentiments were associated with food/taste, while negative sentiments were associated with value. This study further proves a robust classification algorithm based on Support Vector Machine (SVM) with a Fuzzy Domain Ontology (FDO) algorithm outperforms other traditional classification algorithms such as Naïve Bayes (MB) and SVM ontology in predicting the helpfulness of online reviews. This study enriches the literature on managerial aspects of sustainability by analyzing a large amount of plain text data that customers generated. The results of this study could be used as sustainable marketing strategy for review website developers to design sophisticated, intelligence review systems by enabling customers to sort and filter helpful reviews based on their preferences. The extracted aspects and their assigned sentiment could also help restaurateurs better understand how to meet diverse customers’ needs and maintain sustainable competitive advantages.

Download Full-text

Assessing Regression-Based Sentiment Analysis Techniques in Financial Texts

10.5753/eniac.2019.9329 ◽

2019 ◽

Cited By ~ 1

Author(s):

Taynan Ferreira ◽

Francisco Paiva ◽

Roberto Silva ◽

Angel Paula ◽

Anna Costa ◽

...

Keyword(s):

Sentiment Analysis ◽

Feature Representation ◽

Support Vector ◽

Data Set ◽

Feature Representations ◽

Textual Data ◽

Enormous Amount ◽

Financial Domain ◽

Classification Tasks ◽

The Impact

Sentiment analysis (SA) is increasing its importance due to the enormous amount of opinionated textual data available today. Most of the researches have investigated different models, feature representation and hyperparameters in SA classification tasks. However, few studies were conducted to evaluate the impact of these features on regression SA tasks. In this paper, we conduct such assessment on a financial domain data set by investigating different feature representations and hyperparameters in two important models -- Support Vector Regression (SVR) and Convolution Neural Networks (CNN). We conclude presenting the most relevant feature representations and hyperparameters and how they impact outcomes on a regression SA task.

Download Full-text

Analysis of Textual Data Based on Inductive Learning Techniques

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2013040103 ◽

2013 ◽

Vol 3 (2) ◽

pp. 40-57

Author(s):

Shigeaki Sakurai

Keyword(s):

Classification Accuracy ◽

Inductive Learning ◽

Classification Model ◽

Support Vector ◽

Fuzzy Decision ◽

Fuzzy Decision Tree ◽

Named Entity ◽

High Classification Accuracy ◽

Learning Techniques ◽

Textual Data

This paper introduces knowledge discovery methods based on inductive learning techniques from textual data. The author argues three methods extracting features of the textual data. First one activates a key concept dictionary, second one does a key phrase pattern dictionary, and third one does a named entity extractor. These features are used in order to generate rules representing relationships between the features and text classes. The rules are described in the format of a fuzzy decision tree. Also, these features are used in order to acquire a classification model based on SVM (Support Vector Machine). The model can classify new textual data into the text classes with high classification accuracy. Lastly, this paper introduces two application tasks based on these methods and verifies the effect of the methods.

Download Full-text

Study on XML Retrieval Results Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.1773 ◽

2012 ◽

Vol 263-266 ◽

pp. 1773-1777

Author(s):

Hong Yu ◽

Xiao Lei Huang ◽

Zhi Ling Wei ◽

Chen Xia Yang

Keyword(s):

Nearest Neighbor ◽

Classification Performance ◽

Feedback Mechanism ◽

Support Vector ◽

Svm Classifier ◽

K Nearest Neighbor ◽

Xml Retrieval ◽

Text Documents ◽

Plain Text ◽

Xml Documents

Mining (classify or clustering) retrieval results to serve relevance feedback mechanism of search engine is an important solution to improve effectiveness of retrieval. Unlike plain text documents, since the XML documents are semi-structured data, for XML retrieval results classification, consider exploiting structure features of XML documents, such as tag paths and edges etc. We propose to use Support Vector Machine (SVM) classifier to classify XML retrieval results exploiting both their content and structure features. We implemented the classification method on XML retrieval results based on the IEEE SC corpus. Compared with k-nearest neighbor classification (KNN) on the same dataset in our application, SVM perform better. The experiment results have also shown that the use of structure features, especially tag paths and edges, can improve the classification performance significantly.

Download Full-text

A Hybrid Approach for Sarcasm Detection

Technical Journal ◽

10.3126/tj.v1i1.27581 ◽

2019 ◽

Vol 1 (1) ◽

pp. 1-9

Author(s):

S. Luintel ◽

R.K. Sah ◽

B.R. Lamichhane

Keyword(s):

Random Forest ◽

Sentiment Analysis ◽

Hybrid Approach ◽

Weighted Average ◽

Text Summarization ◽

Support Vector ◽

Markup Language ◽

Textual Data ◽

Enormous Amount ◽

Hypertext Markup Language

There is an excessive growth in user generated textual data due to increment in internet and social media users which includes enormous amount of sarcastic words, emoji, sentences. Sarcasm is a nuanced form of communication where individual states opposite of what is implied which is done in order to insult someone, to show irritation, or to be funny. Sarcasm is considered as one of the most difficult problems in sentiment analysis due to its ambiguous nature. Recognizing sarcasm in the texts can promote many sentiment analysis and text summarization applications. So for addressing the problem of sarcasm many steps have been adopted for sarcasm detection. Different preprocessing techniques such as Hypertext markup language removal, stop words removal, etc. have been done. Similarly, conversion of the emoji and smileys into their textual equivalent has been performed. Most frequent features has been selected and a hybrid cascade and hybrid weighted average approaches which are the combinations of the algorithms random forest, naïve Bayes and support vector machine have been used for sarcasm detection. The comparison of these two approaches on different basis has been done which has shown cascade outperformed weighted approach. Moreover, comparison of cascade approaches in terms of the algorithm placement has also been performed in which random forest has proved to be the best.

Download Full-text