Toxic Comment Classification on Social Media Using Support Vector Machine and Chi Square Feature Selection

The use of social media in society continues to increase over time and the ease of access and familiarity of social media then make it easier for an irresponsible user to do unethical things such as spreading hatred, defamation, radicalism, pornography so on. Although there are regulations that govern all the activities on social media. However, the regulations are still not working effectively. In this study, we conducted a classification of toxic comments containing unethical matters using the SVM method with TF-IDF as the feature extraction and Chi Square as the feature selection. The best performance result based on the experiment that has been carried out is by using the SVM model with a linear kernel, without implementing Chi Square, and using stemming and stopwords removal with the F1 − Score equal to 76.57%.

Download Full-text

ANALISIS SENTIMEN PENGGUNA GOPAY MENGGUNAKAN METODE LEXICON BASED DAN SUPPORT VECTOR MACHINE

KOMPUTEK ◽

10.24269/jkt.v3i2.270 ◽

2019 ◽

Vol 3 (2) ◽

pp. 52

Author(s):

Rachmad Mahendrajaya ◽

Ghulam Asrofi Buntoro ◽

Moh Bhanu Setyawan

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Open Access ◽

Polynomial Kernel ◽

Support Vector ◽

Linear Kernel ◽

Negative Comments

Go-Pay is part of the Gojek application and one of the most popular finteches in Indonesia. Although the most popular, not all users have positive or even negative comments. Now users can submit various media opinions, one of which is Twitter. Twitter media has the advantage of a simple display, updated topics, open access to tweets and express opinions quickly. From a variety of comments on Twitter it takes a technique to divide into classes positive or negative opinions. This study uses prepocessing and labeling opinions into positive and negative classes with the lexicon Based method. As for the classification using the Support Vector Machine (SVM) method. The data used in the form of opinions about Go- Pay reviews from social media Twitter, amounting to 1210. The results of labeling with Lexicon Based amounted to 923 for positive and 287 for negative. While the classification of the SVM method using the Linear kernel produces 89.17% and 84.38% for the Polynomial kernel.

Download Full-text

Analysis of Sentiment of Moving a National Capital with Feature Selection Naive Bayes Algorithm and Support Vector Machine

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i3.1942 ◽

2020 ◽

Vol 4 (3) ◽

pp. 504-512

Author(s):

Faried Zamachsari ◽

Gabriel Vangeran Saragih ◽

Susafa'ati ◽

Windu Gata

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Feature Selection ◽

Public Opinion ◽

Naive Bayes ◽

Naïve Bayes ◽

Capital City ◽

Support Vector ◽

National Capital ◽

Bayes Algorithm

The decision to move Indonesia's capital city to East Kalimantan received mixed responses on social media. When the poverty rate is still high and the country's finances are difficult to be a factor in disapproval of the relocation of the national capital. Twitter as one of the popular social media, is used by the public to express these opinions. How is the tendency of community responses related to the move of the National Capital and how to do public opinion sentiment analysis related to the move of the National Capital with Feature Selection Naive Bayes Algorithm and Support Vector Machine to get the highest accuracy value is the goal in this study. Sentiment analysis data will take from public opinion using Indonesian from Twitter social media tweets in a crawling manner. Search words used are #IbuKotaBaru and #PindahIbuKota. The stages of the research consisted of collecting data through social media Twitter, polarity, preprocessing consisting of the process of transform case, cleansing, tokenizing, filtering and stemming. The use of feature selection to increase the accuracy value will then enter the ratio that has been determined to be used by data testing and training. The next step is the comparison between the Support Vector Machine and Naive Bayes methods to determine which method is more accurate. In the data period above it was found 24.26% positive sentiment 75.74% negative sentiment related to the move of a new capital city. Accuracy results using Rapid Miner software, the best accuracy value of Naive Bayes with Feature Selection is at a ratio of 9:1 with an accuracy of 88.24% while the best accuracy results Support Vector Machine with Feature Selection is at a ratio of 5:5 with an accuracy of 78.77%.

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Multiclass classification of leukemia cancer data using Fuzzy Support Vector Machine (FSVM) with feature selection using Principal Component Analysis (PCA)

Journal of Physics Conference Series ◽

10.1088/1742-6596/1725/1/012012 ◽

2021 ◽

Vol 1725 ◽

pp. 012012

Author(s):

I R Fauzi ◽

Z Rustam ◽

A Wibowo

Keyword(s):

Principal Component Analysis ◽

Support Vector Machine ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Multiclass Classification ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Cancer Data

Download Full-text

Penggunaan media sosial dan peran orang tua terhadap kejadian pernikahan dini

HOLISTIK JURNAL KESEHATAN ◽

10.33024/hjk.v14i3.2794 ◽

2020 ◽

Vol 14 (3) ◽

pp. 445-451

Author(s):

Asnuddin Asnuddin ◽

Asrini Mattrah

Keyword(s):

Social Media ◽

Cross Sectional Study ◽

Early Marriage ◽

P Value ◽

Chi Square ◽

Cross Sectional ◽

Chi Square Test ◽

Role Of Parents ◽

Use Of Social Media

Social media use: The role of parents' perceptions about social media impact in early marriageBackground: Early marriage is a marriage that is conducted at adolescence, the factors causing early marriage are socio-cultural factors, economic pressure, level of education, difficulty in getting a job, social media, religion and views and beliefsPurpose: To find out the influence of social media and the role of parents on the incidence of early marriageMethod: A quantitative research using descriptive analytical research method with cross sectional study design with variable use of social media with the criteria for the results "active and inactive". For the variable of the role of parents with 2 outcome criteria, namely "influential and not influential". And for the variable incidence of early marriage, there are 2 criteria, namely age 14-16 years and age 17-19 years, the questionnaire used has been through the validity of previous researchers. Then the results of the data obtained were analyzed in the SPSS program using the Chi Square testResults: From the Chi Square test results for social media variables obtained p value = 0.001, then the value of p = 0.001 <0.05 (α) while the role of parents variable Chi Square test results obtained p value = 0.022, therefore the value of p = 0.022 <0.05 (α).Conclusion: Based on the results of the study it can be concluded that, there is a significant influence between the use of social media and the role of parents in the event of early marriage Keywords: Social media; Parents, Early marriagePendahuluan: Pernikahan usia dini adalah perkawinan yang dilakukan pada usia remaja, faktor penyebab pernikahan usia dini adalah faktor sosial budaya, desakan ekonomi, tingkat pendidikan, sulit mendapat pekerjaan, media sosial, agama serta pandangan dan kepercayaan.Tujuan: Untuk mengetahui pengaruh media sosial dan peran orang tua terhadap kejadian pernikahan dini di Kecamatan Marioriawa Kabupaten Soppeng Metode: Penelitian kuantitatif dengan menggunakan metode penelitian deskriptif analitik dengan rancangan cross sectional study dengan variabel penggunaan media sosial dengan kriteria hasil “aktif dan tidak aktif”. Untuk variabel peran orang tua dengan 2 kriteria hasil yaitu “berpengaruh dan tidak berpengaruh”. Dan untuk variabel kejadian pernikahan dini ada 2 kriteria yaitu umur 14-16 tahun dan umur 17-19 tahun, kuesioner yang di gunakan sudah melalui uji validitas peneliti sebelumnya. Kemudian hasil data yang di dapatkan dianalisis di program SPSS dengan menggunakan uji Chi SquareHasil: Dari hasil uji Chi Square untuk variabel media sosial didapatkan nilai p=0,001, maka nilai p=0.001<0.05 (α) sedangkan variabel peran orang tua hasil uji Chi Square didapatkan nilai p=0,022, oleh karena itu nilai p=0.022<0.05 (α).Simpulan: Berdasarkan hasil penelitian dapat di simpulkan bahwa, Ada pengaruh yang signifikan antara penggunaan media sosial dan peran orang tua terhadap kejadian pernikahan dini

Download Full-text

Classification of good visual acuity over time in patients with branch retinal vein occlusion with macular edema using support vector machine

Graefe s Archive for Clinical and Experimental Ophthalmology ◽

10.1007/s00417-021-05455-y ◽

2021 ◽

Author(s):

Yoshitsugu Matsui ◽

Kazuya Imamura ◽

Mihiro Ooka ◽

Shinichiro Chujo ◽

Yoko Mase ◽

...

Keyword(s):

Support Vector Machine ◽

Visual Acuity ◽

Macular Edema ◽

Retinal Vein Occlusion ◽

Branch Retinal Vein Occlusion ◽

Support Vector ◽

Good Visual Acuity ◽

Vein Occlusion ◽

Over Time

Download Full-text

Use of Social Media for Knowledge Sharing Among Students

Asian Journal of Information Science and Technology ◽

10.51983/ajist-2018.8.2.174 ◽

2018 ◽

Vol 8 (2) ◽

pp. 65-75

Author(s):

Funmilola O. Omotayo ◽

Olugboyega M. Salami

Keyword(s):

Social Media ◽

Knowledge Sharing ◽

Significant Relationship ◽

Rank Correlation ◽

Sampling Technique ◽

Analysis Data ◽

Chi Square ◽

Spearman’S Rank Correlation ◽

Use Of Social Media ◽

Set Up

The world of research requires researchers, students to share knowledge. With the invention of social media, knowledge sharing process has been more effective and easier. This study examined the usage of social media for knowledge sharing among students of the Polytechnic Ibadan, Nigeria. Descriptive survey research design was adopted, while stratified random sampling technique was adopted to select the students. Four hundred and thirty four copies of questionnaire were administered, while 301 were retrieved and 271 copies found useful for data analysis. Data was analysed using frequencies and percentage distribution, Spearman’s rank correlation, Kruskal Wallis test, and Chi-Square. Findings reveal that Facebook and Whatsapp are the widely used social media tools for knowledge sharing by the students. The study found significant relationship between social influence and attitude towards using social media for knowledge sharing, as well as significant relationship between attitude and use of social media for knowledge sharing.The study recommends that institutions should exploit the proliferation of social media and its use to set up off-class student-student and student-lecturer discussion groups, which could help encourage and promote knowledge sharing, and thereby help students in achieving good academic outcomes.

Download Full-text

Tracking self-reported symptoms and medical conditions on social media during the COVID-19 pandemic (Preprint)

10.2196/preprints.29413 ◽

2021 ◽

Author(s):

Qinglan Ding ◽

Daisy Massey ◽

Chenxi Huang ◽

Connor Grady ◽

Yuan Lu ◽

...

Keyword(s):

Mental Health ◽

Infectious Disease ◽

Social Media ◽

Positive Predictive Value ◽

Medical Condition ◽

Medical Conditions ◽

Health Related ◽

The U.S ◽

Over Time

BACKGROUND Harnessing health-related data posted on social media in real-time has the potential to offer insights into how the pandemic impacts the mental health and general well-being of individuals and populations over time. OBJECTIVE The aim of this study was to obtain information on symptoms and medical conditions self-reported by non-Twitter social media users during the coronavirus disease 2019 (COVID-19) pandemic, and to determine how discussion of these symptoms and medical conditions on social media changed over time. METHODS We used natural language processing (NLP) algorithms to identify symptom and medical condition topics being discussed on social media between June 14 and December 13, 2020. The sample social media posts were geotagged by NetBase, a third-party data provider. We calculated the positive predictive value and sensitivity to validate the classification of the posts. We also assessed the frequency of different health-related discussions on social media over time during the study period, and compared the changes in the frequency of each symptom/medical condition discussion to the fluctuation of U.S. daily new COVID-19 cases during the study period. Additionally, we compared the trends of the 5 most commonly mentioned symptoms and medical conditions from June 14 to August 31 (when the U.S. passed 6 million COVID-19 cases) to the trends observed from September 1 to December 13, 2020. RESULTS Within a total of 9,807,813 posts (nearly 70% were sourced from the U.S.), we identified discussion of 120 symptom topics and 1,542 medical condition topics. Our classification of the health-related posts had a positive predictive value of over 80% and an average classification rate of 92% sensitivity. The 5 most commonly mentioned symptoms on social media during the study period were: anxiety (in 201,303 posts or 12.2% of the total posts mentioning symptoms), generalized pain (189,673, 11.5%), weight loss (95,793, 5.8%), fatigue (91,252, 5.5%), and coughing (86,235, 5.2%). The 5 most discussed medical conditions were: COVID-19 (in 5,420,276 posts or 66.4% of the total posts mentioning medical conditions), unspecified infectious disease (469,356, 5.8%), influenza (270,166, 3.3%), unspecified disorders of the central nervous system (253,407, 3.1%), and depression (151,752, 1.9%). The changes in the frequency of 2 medical conditions, COVID-19 and unspecified infectious disease, were similar to the fluctuation of daily new confirmed cases of COVID-19 in the U.S. CONCLUSIONS COVID-19 and symptoms of anxiety were the two most commonly discussed health-related topics on social media from June 14 to December 13, 2020. Real-time monitoring of social media posts on symptoms and medical conditions may help assess the population's mental health status and enhance public health surveillance for infectious disease.

Download Full-text

Ensemble-Based Feature Selection With Long Short-Term Memory for Classification of Network Intrusion

Advances in Social Networking and Online Communities - E-Collaboration Technologies and Strategies for Competitive Advantage Amid Challenging Times ◽

10.4018/978-1-7998-7764-6.ch008 ◽

2021 ◽

pp. 228-245

Author(s):

Preethi D. ◽

Neelu Khare

Keyword(s):

Feature Selection ◽

Performance Metrics ◽

Short Term Memory ◽

Short Term ◽

Chi Square ◽

Term Memory ◽

Network Intrusion ◽

Proposed Model ◽

Long Short Term Memory

This chapter presents an ensemble-based feature selection with long short-term memory (LSTM) model. A deep recurrent learning model is proposed for classifying network intrusion. This model uses ensemble-based feature selection (EFS) for selecting the appropriate features from the dataset and long short-term memory for the classification of network intrusions. The EFS combines five feature selection techniques, namely information gain, gain ratio, chi-square, correlation-based feature selection, and symmetric uncertainty-based feature selection. The experiments were conducted using the standard benchmark NSL-KDD dataset and implemented using tensor flow and python. The proposed model is evaluated using the classification performance metrics and also compared with all the 41 features without any feature selection as well as with each individual feature selection technique and classified using LSTM. The performance study showed that the proposed model performs better, with 99.8% accuracy, with a higher detection and lower false alarm rates.

Download Full-text

Discrimination of Gentiana and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization

Molecules ◽

10.3390/molecules25061442 ◽

2020 ◽

Vol 25 (6) ◽

pp. 1442 ◽

Cited By ~ 2

Author(s):

Tao Shen ◽

Hong Yu ◽

Yuan-Zhong Wang

Keyword(s):

Genetic Algorithm ◽

Support Vector Machine ◽

Feature Selection ◽

Related Species ◽

Predictive Accuracy ◽

Classification Model ◽

Venn Diagram ◽

Support Vector ◽

Stacked Generalization ◽

Svm Model

Gentiana, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial to accurately identify authentic Gentiana species. In this paper, the feasibility of using the infrared spectroscopy technique combined with chemometrics analysis to identify Gentiana and its related species was studied. A total of 180 batches of raw spectral fingerprints were obtained from 18 species of Gentiana and Tripterospermum by near-infrared (NIR: 10,000–4000 cm−1) and Fourier transform mid-infrared (MIR: 4000–600 cm−1) spectrum. Firstly, principal component analysis (PCA) was utilized to explore the natural grouping of the 180 samples. Secondly, random forests (RF), support vector machine (SVM), and K-nearest neighbors (KNN) models were built while using full spectra (including 1487 NIR variables and 1214 FT-MIR variables, respectively). The MIR-SVM model had a higher classification accuracy rate than the other models that were based on the results of the calibration sets and prediction sets. The five feature selection strategies, VIP (variable importance in the projection), Boruta, GARF (genetic algorithm combined with random forest), GASVM (genetic algorithm combined with support vector machine), and Venn diagram calculation, were used to reduce the dimensions of the data variable in order to further reduce numbers of variables for modeling. Finally, 101 NIR and 73 FT-MIR bands were selected as the feature variables, respectively. Thirdly, stacking models were built based on the optimal spectral dataset. Most of the stacking models performed better than the full spectra-based models. RF and SVM (as base learners), combined with the SVM meta-classifier, was the optimal stacked generalization strategy. For the SG-Ven-MIR-SVM model, the accuracy (ACC) of the calibration set and validation set were both 100%. Sensitivity (SE), specificity (SP), efficiency (EFF), Matthews correlation coefficient (MCC), and Cohen’s kappa coefficient (K) were all 1, which showed that the model had the optimal authenticity identification performance. Those parameters indicated that stacked generalization combined with feature selection is probably an important technique for improving the classification model predictive accuracy and avoid overfitting. The study result can provide a valuable reference for the safety and effectiveness of the clinical application of medicinal Gentiana.

Download Full-text