Clickbait detection using multiple categorisation techniques

Clickbaits are online articles with deliberately designed misleading titles for luring more and more readers to open the intended web page. Clickbaits are used to tempt visitors to click on a particular link either to monetise the landing page or to spread the false news for sensationalisation. The presence of clickbaits on any news aggregator portal may lead to unpleasant experience to readers. Automatic detection of clickbait headlines from news headlines has been a challenging issue for the machine learning community. A lot of methods have been proposed for preventing clickbait articles in recent past. However, the recent techniques available in detecting clickbaits are not much robust. This article proposes a hybrid categorisation technique for separating clickbait and non-clickbait articles by integrating different features, sentence structure and clustering. During preliminary categorisation, the headlines are separated using 11 features. After that, the headlines are recategorised using sentence formality and syntactic similarity measures. In the last phase, the headlines are again recategorised by applying clustering using word vector similarity based on t-stochastic neighbourhood embedding ( t-SNE) approach. After categorisation of these headlines, machine learning models are applied to the dataset to evaluate machine learning algorithms. The obtained experimental results indicate that the proposed hybrid model is more robust, reliable and efficient than any individual categorisation techniques for the dataset we have used.

Download Full-text

Incremental and Iterative Learning of Answer Set Programs from Mutually Distinct Examples

Theory and Practice of Logic Programming ◽

10.1017/s1471068418000248 ◽

2018 ◽

Vol 18 (3-4) ◽

pp. 623-637 ◽

Cited By ~ 2

Author(s):

ARINDAM MITRA ◽

CHITTA BARAL

Keyword(s):

Machine Learning ◽

Learning Community ◽

Question Answering ◽

Learning Algorithms ◽

Inductive Logic ◽

Opportunity To Learn ◽

Machine Learning Algorithms ◽

Knowledge Representation And Reasoning ◽

Handwritten Digit ◽

Answer Set

AbstractOver the years the Artificial Intelligence (AI) community has produced several datasets which have given the machine learning algorithms the opportunity to learn various skills across various domains. However, a subclass of these machine learning algorithms that aimed at learning logic programs, namely the Inductive Logic Programming algorithms, have often failed at the task due to the vastness of these datasets. This has impacted the usability of knowledge representation and reasoning techniques in the development of AI systems. In this research, we try to address this scalability issue for the algorithms that learn answer set programs. We present a sound and complete algorithm which takes the input in a slightly different manner and performs an efficient and more user controlled search for a solution. We show via experiments that our algorithm can learn from two popular datasets from machine learning community, namely bAbl (a question answering dataset) and MNIST (a dataset for handwritten digit recognition), which to the best of our knowledge was not previously possible. The system is publicly available athttps://goo.gl/KdWAcV.

Download Full-text

MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Applied Sciences ◽

10.3390/app10217831 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7831

Author(s):

Han Kyul Kim ◽

Sae Won Choi ◽

Ye Seul Bae ◽

Jiin Choi ◽

Hyein Kwon ◽

...

Keyword(s):

Machine Learning ◽

Contextual Information ◽

String Matching ◽

Similarity Measures ◽

Mapping Method ◽

Machine Learning Algorithms ◽

Training Data ◽

Context Aware ◽

Text Data ◽

Data Standardization

With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.

Download Full-text

Ensemble Learning Approach for Clickbait Detection Using Article Headline Features

10.28945/4319 ◽

2019 ◽

Keyword(s):

Ensemble Learning ◽

Language Processing ◽

Classification Model ◽

Web Page ◽

Learning Techniques ◽

False News ◽

Processing Techniques ◽

News Headlines ◽

Unpleasant Experience ◽

Better Than

[This Proceedings paper was revised and published in the 2019 issue of the journal Informing Science: The International Journal of an Emerging Transdiscipline, Volume 22] Aim/Purpose: The aim of this paper is to propose an ensemble learners based classification model for classification clickbaits from genuine article headlines. Background: Clickbaits are online articles with deliberately designed misleading titles for luring more and more readers to open the intended web page. Clickbaits are used to tempted visitors to click on a particular link either to monetize the landing page or to spread the false news for sensationalization. The presence of clickbaits on any news aggregator portal may lead to an unpleasant experience for readers. Therefore, it is essential to distinguish clickbaits from authentic headlines to mitigate their impact on readers’ perception. Methodology: A total of one hundred thousand article headlines are collected from news aggregator sites consists of clickbaits and authentic news headlines. The collected data samples are divided into five training sets of balanced and unbalanced data. The natural language processing techniques are used to extract 19 manually selected features from article headlines. Contribution: Three ensemble learning techniques including bagging, boosting, and random forests are used to design a classifier model for classifying a given headline into the clickbait or non-clickbait. The performances of learners are evaluated using accuracy, precision, recall, and F-measures. Findings: It is observed that the random forest classifier detects clickbaits better than the other classifiers with an accuracy of 91.16 %, a total precision, recall, and f-measure of 91 %.

Download Full-text

Selección de tutores académicos en la educación superior usando árboles de decisión

REOP - Revista Española de Orientación y Psicopedagogía ◽

10.5944/reop.vol.29.num.1.2018.23297 ◽

2018 ◽

Vol 29 (1) ◽

pp. 108

Author(s):

Argelia B. Urbina Nájera ◽

Jorge De la Calleja

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Learning Community ◽

Machine Learning Algorithms ◽

Voluntary Participation ◽

Educación Superior ◽

Decision Tree Algorithm ◽

Data Set ◽

Educacion Superior ◽

Real World Problems

RESUMEN En este documento se presenta un método para mejorar el proceso de tutoría académica en la educación superior. El método incluye a identificación de las habilidades principales de los tutores de forma automática utilizando el algoritmo árboles de decisión, uno de los algoritmos más utilizados en la comunidad de aprendizaje automático para resolver problemas del mundo real con gran precisión. En el estudio, el algoritmo arboles de decisión fue capaz de identificar las habilidades y afinidades entre estudiantes y tutores. Los experimentos se llevaron a cabo utilizando un conjunto de datos de 277 estudiantes y 19 tutores, mismos que fueron seleccionados por muestreo aleatorio simple y participación voluntaria en el caso de los tutores. Los resultados preliminares muestran que los atributos más importantes para los tutores son la comunicación, la autodirección y las habilidades digitales. Al mismo tiempo, se presenta un proceso de tutoría en el que la asignación del tutor se basa en estos atributos, asumiendo que puede ayudar a fortalecer las habilidades de los estudiantes que demanda la sociedad actual. De la misma forma, el árbol de decisión obtenido se puede utilizar para agrupar a tutores y estudiantes basados en sus habilidades y afinidades personales utilizando otros algoritmos de aprendizaje automático. La aplicación del proceso de tutoría sugerido podría dar la pauta para ver el proceso de tutoría de manera individual sin vincularla a procesos de desempeño académico o deserción escolar.ABSTRACTIn this paper, we present a method for the tutoring process in order to improve academic tutoring in higher education. The method includes identifying the main skills of tutors in an automated manner using decision trees, one of the most used algorithms in the machine learning community for solving several real-world problems with high accuracy. In our study, the decision tree algorithm was able to identify those skills and personal affinities between students and tutors. Experiments were carried out using a data set of 277 students and 19 tutors, which were selected by random sampling and voluntary participation, respectively. Preliminary results show that the most important attributes for tutors are communication, self-direction and digital skills. At the same time, we introduce a tutoring process where the tutor assignment is based on these attributes, assuming that it can help to strengthen the student's skills demanded by today's society. In the same way, the decision tree obtained can be used to create cluster of tutors and clusters of students based on their personal abilities and affinities using other machine learning algorithms. The application of the suggested tutoring process could set the tone to see the tutoring process individually without linking it to processes of academic performance or school dropout.

Download Full-text

A SURVEY ON SIMILARITY MEASURES AND MACHINE LEARNING ALGORITHMS FOR CLASSIFICATION AND PREDICTION

International Conference on Data Science, E-learning and Information Systems 2021 ◽

10.1145/3460620.3460755 ◽

2021 ◽

Author(s):

Sravan kiran Vangipuram ◽

Rajesh Appusamy

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Similarity Measures ◽

Machine Learning Algorithms

Download Full-text

Machine learning in tutorials – Universal applicability, underinformed application, and other misconceptions

Big Data & Society ◽

10.1177/20539517211017593 ◽

2021 ◽

Vol 8 (1) ◽

pp. 205395172110175

Author(s):

Hendrik Heuer ◽

Juliane Jarke ◽

Andreas Breiter

Keyword(s):

Machine Learning ◽

Information Systems ◽

Learning Community ◽

Critical Analysis ◽

Prior Information ◽

Social Inequalities ◽

Information Source ◽

Machine Learning Algorithms ◽

Special Expertise ◽

Universal Applicability

Machine learning has become a key component of contemporary information systems. Unlike prior information systems explicitly programmed in formal languages, ML systems infer rules from data. This paper shows what this difference means for the critical analysis of socio-technical systems based on machine learning. To provide a foundation for future critical analysis of machine learning-based systems, we engage with how the term is framed and constructed in self-education resources. For this, we analyze machine learning tutorials, an important information source for self-learners and a key tool for the formation of the practices of the machine learning community. Our analysis identifies canonical examples of machine learning as well as important misconceptions and problematic framings. Our results show that machine learning is presented as being universally applicable and that the application of machine learning without special expertise is actively encouraged. Explanations of machine learning algorithms are missing or strongly limited. Meanwhile, the importance of data is vastly understated. This has implications for the manifestation of (new) social inequalities through machine learning-based systems.

Download Full-text

Ensemble Learning Approach for Clickbait Detection Using Article Headline Features

Informing Science The International Journal of an Emerging Transdiscipline ◽

10.28945/4279 ◽

2019 ◽

Vol 22 ◽

pp. 031-044 ◽

Cited By ~ 4

Author(s):

Dilip Singh Sisodia

Keyword(s):

Ensemble Learning ◽

Language Processing ◽

Classification Model ◽

Web Page ◽

Learning Techniques ◽

False News ◽

Processing Techniques ◽

News Headlines ◽

Unpleasant Experience ◽

Better Than

Aim/Purpose: The aim of this paper is to propose an ensemble learners based classification model for classification clickbaits from genuine article headlines. Background: Clickbaits are online articles with deliberately designed misleading titles for luring more and more readers to open the intended web page. Clickbaits are used to tempted visitors to click on a particular link either to monetize the landing page or to spread the false news for sensationalization. The presence of clickbaits on any news aggregator portal may lead to an unpleasant experience for readers. Therefore, it is essential to distinguish clickbaits from authentic headlines to mitigate their impact on readers’ perception. Methodology: A total of one hundred thousand article headlines are collected from news aggregator sites consists of clickbaits and authentic news headlines. The collected data samples are divided into five training sets of balanced and unbalanced data. The natural language processing techniques are used to extract 19 manually selected features from article headlines. Contribution: Three ensemble learning techniques including bagging, boosting, and random forests are used to design a classifier model for classifying a given headline into the clickbait or non-clickbait. The performances of learners are evaluated using accuracy, precision, recall, and F-measures. Findings: It is observed that the random forest classifier detects clickbaits better than the other classifiers with an accuracy of 91.16 %, a total precision, recall, and f-measure of 91 %.

Download Full-text

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Data and Information Management ◽

10.2478/dim-2018-0004 ◽

2018 ◽

Vol 2 (1) ◽

pp. 27-36 ◽

Cited By ~ 1

Author(s):

Neil R. Smalheiser ◽

Aaron M. Cohen

Keyword(s):

Machine Learning ◽

Text Mining ◽

Language Processing ◽

Similarity Measures ◽

Machine Learning Algorithms ◽

Biomedical Literature ◽

Publication Type ◽

Open Platform ◽

Training Examples ◽

Learning Projects

Abstract Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.

Download Full-text

Survey on Fake News or Truth - Rumours Detection using Machine Learning

Journal of Information Technology and Sciences ◽

10.46610/joits.2021.v07i03.003 ◽

2021 ◽

Vol 7 (3) ◽

Author(s):

Kalyani Deore ◽

Leena Gaikwad ◽

Rohit Dhamne ◽

Vishal Agale ◽

T. Bhaskar

Keyword(s):

Machine Learning ◽

Social Media ◽

Machine Learning Algorithms ◽

Modern World ◽

Fake News ◽

Misleading Information ◽

Original Source ◽

News Stories ◽

False News ◽

Communication Methods

This study is to help readers to understand detection of fake news using machine learning. The main purpose of the planned system is to build an application which identifies fake news stories from a bunch of news stories to make people aware of fake news rumours. With the help of machine learning algorithms, we can detect and separate out the fake news. Nowadays, it is become harder to identify the original source of news stories, like looking for a needle in a haystack. In the modern world, news is a kind of communication that keeps us up to date on the latest events, topics, and people in the wider globe. A society relies on news for a variety of reasons, the most important of which is informing its members about events taking on in and around them that might influence them. Oral and traditional media, as well as digital communication methods, altered videos, memes, unconfirmed marketing, and social media have all contributed to the spread of rumors. As nowadays many people use social media in many cases people get wrong and misleading information and people share those stories without verifying whether it is real or fake news stories. Spreading false information on social media has become a major problem these days. That is why we need a system that can tell us whether something is false news or not. Applications are: 1. Fake news may be detected on social media using this approach. 2. The system can be used to help news channels to broadcast only real and classified news. 3. Users can easily detect and eliminate fake articles that contain misinformation intended to mislead readers.

Download Full-text

Text Similarity Detection Using Machine Learning Algorithms with Character-Based Similarity Measures

Digital Interaction and Machine Intelligence - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-74728-2_2 ◽

2021 ◽

pp. 11-19

Author(s):

Emil Kalbaliyev ◽

Samir Rustamov

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Similarity Measures ◽

Machine Learning Algorithms ◽

Text Similarity ◽

Similarity Detection

Download Full-text