Detecting and Classifying Toxic Language in Twitter using Machine Learning

Abstract: To avoid fraudulent Job postings on the internet, we target to minimize the number of such frauds through the Machine Learning approach to predict the chances of a job being fake so that the candidate can stay alert and make informed decisions if required. The model will use NLP to analyze the sentiments and pattern in the job posting and TF-IDF vectorizer for feature extraction. In this model, we are going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the data and for classification, we used Random Forest to predict output with high accuracy, even for the large dataset it runs efficiently, and it enhances the accuracy of the model and prevents the overfitting issue. The final model will take in any relevant job posting data and produce a result determining whether the job is real or fake. Keywords: Natural Language Processing (NLP), Term Frequency-Inverse Document Frequency (TF-IDF), Synthetic Minority Oversampling Technique (SMOTE), Random Forest.

Download Full-text

Intelligent detection of hate speech in Arabic social network: A machine learning approach

Journal of Information Science ◽

10.1177/0165551520917651 ◽

2020 ◽

pp. 016555152091765 ◽

Cited By ~ 1

Author(s):

Ibrahim Aljarah ◽

Maria Habib ◽

Neveen Hijazi ◽

Hossam Faris ◽

Raneem Qaddoura ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Hate Speech ◽

Predictive Ability ◽

Support Vector ◽

Inverse Document Frequency ◽

Document Frequency ◽

Machine Learning Approach ◽

Importance Analysis ◽

Intelligent Detection

Nowadays, cyber hate speech is increasingly growing, which forms a serious problem worldwide by threatening the cohesion of civil societies. Hate speech relates to using expressions or phrases that are violent, offensive or insulting for a person or a minority of people. In particular, in the Arab region, the number of Arab social media users is growing rapidly, which is accompanied with high increasing rate of cyber hate speech. This drew our attention to aspire healthy online environments that are free of hatred and discrimination. Therefore, this article aims to detect cyber hate speech based on Arabic context over Twitter platform, by applying Natural Language Processing (NLP) techniques, and machine learning methods. The article considers a set of tweets related to racism, journalism, sports orientation, terrorism and Islam. Several types of features and emotions are extracted and arranged in 15 different combinations of data. The processed dataset is experimented using Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and Random Forest (RF), in which RF with the feature set of Term Frequency-Inverse Document Frequency (TF-IDF) and profile-related features achieves the best results. Furthermore, a feature importance analysis is conducted based on RF classifier in order to quantify the predictive ability of features in regard to the hate class.

Download Full-text

Amrita-CEN-Senti-DB:Twitter Dataset for Sentimental Analysis and Application of Classical Machine Learning and Deep Learning

10.36227/techrxiv.12058968 ◽

2020 ◽

Author(s):

vinayakumar R

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

Learning Algorithm ◽

Support Vector ◽

Text Data ◽

Inverse Document Frequency ◽

Logistics Regression ◽

Document Frequency

Social media is a platform in which tons and tons of text are generated each and every day. The data is so large that cannot be easily understood, so this has paved a path to a new field in the information technology which is natural language processing. In this paper, the text data which is used for the classification is tweets that determines the state of the person according of the sentiments which is positive, negative and neutral. Emotions are the way of expression of the person’s feelings which has a high influence on the decision making tasks. Here we have proposed the text representation, Term Frequency Inverse Document Frequency (tfidf), Keras embedding along with the machine learning and deep learning algorithms for the purpose of the classification of the sentiments, out of which Logistics Regression machine learning based methods out performs well when the features is taken in the limited amount as the features increases Support Vector Machine (SVM) which is also one of the machine learning algorithm out performs well making a benchmark accuracy for this dataset as the 75.8%. For the research purpose the dataset has been made publically available.

Download Full-text

Amrita-CEN-Senti-DB:Twitter Dataset for Sentimental Analysis and Application of Classical Machine Learning and Deep Learning

10.36227/techrxiv.12058968.v1 ◽

2020 ◽

Author(s):

vinayakumar R

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

Learning Algorithm ◽

Support Vector ◽

Text Data ◽

Inverse Document Frequency ◽

Logistics Regression ◽

Document Frequency

Social media is a platform in which tons and tons of text are generated each and every day. The data is so large that cannot be easily understood, so this has paved a path to a new field in the information technology which is natural language processing. In this paper, the text data which is used for the classification is tweets that determines the state of the person according of the sentiments which is positive, negative and neutral. Emotions are the way of expression of the person’s feelings which has a high influence on the decision making tasks. Here we have proposed the text representation, Term Frequency Inverse Document Frequency (tfidf), Keras embedding along with the machine learning and deep learning algorithms for the purpose of the classification of the sentiments, out of which Logistics Regression machine learning based methods out performs well when the features is taken in the limited amount as the features increases Support Vector Machine (SVM) which is also one of the machine learning algorithm out performs well making a benchmark accuracy for this dataset as the 75.8%. For the research purpose the dataset has been made publically available.

Download Full-text

Detection of Hate Speech and offensive Language on Sentiment Analysis using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e1985.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 136-139

Keyword(s):

Hate Speech ◽

Machine Learning Techniques ◽

Support Vector ◽

Inverse Document Frequency ◽

Document Frequency ◽

Learning Techniques ◽

Learning Technique ◽

A New Technique ◽

Text Content ◽

Offensive Language

Toxic online content (TOC) has become a significant problem in current day’s world due to uses of the internet by people of distinct culture, social, organization and industries background and followed Twitter, Facebook, WhatsApp,Instagram, and telegram, etc. Even now, there is lots of work going on related to single-label classification for the text analysis and to make less comparative to errors and more efficient. But in recent years, there is a shift towards the multi-label classification, which can be applicable for both text and images. But text classification is not much popular among the researchers when compared to the grading for images. So, in this work, we are using the dataset which is going to be a short messages dataset, to train and develop a model which can tag multiple labels for the messages. Hate speech, and offensive language is a key challenge in automatic detection of toxic text content. In this paper, to contribute term frequency–inverse document frequency(Tf-Idf), Random forest, Support Vector Machine (SVM),and Bayes Naïve classifier approaches for automatically classify tweets. After tuning the model giving the best results, it achieves an Efficient accuracy for evaluating test data analysis. In this contribution of work also moderate and encapsulate paradigms which will communicate and working between the user and Twitter API. Instead of using the traditional techniques like Bag of words or word counter, a new technique which uses Tf-Idf is built to find the similarity, and the text is transformed into the vectors using Tf-Idf, and this is used to train the model using supervised learning technique along with the labels from the dataset. The accuracy of the model is quite good and more efficient with better results.

Download Full-text

Machine learning, waveform preprocessing and feature extraction methods for classification of acoustic startle waveforms

MethodsX ◽

10.1016/j.mex.2020.101166 ◽

2021 ◽

Vol 8 ◽

pp. 101166

Author(s):

Timothy J. Fawcett ◽

Chad S. Cooper ◽

Ryan J. Longenecker ◽

Joseph P. Walton

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Acoustic Startle ◽

Extraction Methods

Download Full-text

Machine learning techniques for hate speech classification of twitter data: State-of-the-art, future challenges and research directions

Computer Science Review ◽

10.1016/j.cosrev.2020.100311 ◽

2020 ◽

Vol 38 ◽

pp. 100311

Author(s):

Femi Emmanuel Ayo ◽

Olusegun Folorunso ◽

Friday Thomas Ibharalu ◽

Idowu Ademola Osinuga

Keyword(s):

Machine Learning ◽

Hate Speech ◽

State Of The Art ◽

Machine Learning Techniques ◽

Research Directions ◽

Twitter Data ◽

Learning Techniques ◽

Future Challenges ◽

Speech Classification

Download Full-text

Feature extraction and machine learning for the classification of active cropland in the Aral Sea Basin

2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) ◽

10.1109/igarss.2017.8127326 ◽

2017 ◽

Author(s):

Dimo Dimov ◽

Fabian Low ◽

Mirzahayot Ibrakhimov ◽

Sarah-Schonbrodt-Stitt ◽

Christopher Conrad

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Aral Sea ◽

Aral Sea Basin

Download Full-text

An effective approach to feature extraction for classification of plant diseases using machine learning

Indian Journal of Science and Technology ◽

10.17485/ijst/v13i32.827 ◽

2020 ◽

Vol 13 (32) ◽

pp. 3295-3314

Author(s):

S Jeyalakshmi ◽

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Plant Diseases

Download Full-text

Aspect Category Classification dengan Pendekatan Machine Learning Menggunakan Dataset Bahasa Indonesia

Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) ◽

10.22146/jnteti.v10i3.1819 ◽

2021 ◽

Vol 10 (3) ◽

pp. 229-235

Author(s):

Syaifulloh Amien Pandega Perdana ◽

Teguh Bharata Aji ◽

Ridi Ferdiana

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Sentiment Analysis ◽

Support Vector ◽

Term Weighting ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Bahasa Indonesia

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.

Download Full-text