Multi-label text classification with an ensemble feature space

2021 ◽  
pp. 1-12
Author(s):  
Kushagri Tandon ◽  
Niladri Chatterjee

Multi-label text classification aims at assigning more than one class to a given text document, which makes the task more ambiguous and challenging at the same time. The ambiguities come from the fact that often several labels in the prescribed label set are semantically close to each other, making clear demarcation between them difficult. As a consequence, any Machine Learning based approach for developing multi-label classification scheme needs to define its feature space by choosing features beyond linguistic or semi-linguistic features, so that the semantic closeness between the labels is also taken into account. The present work describes a scheme of feature extraction where the training document set and the prescribed label set are intertwined in a novel way to capture the ambiguity in a meaningful way. In particular, experiments were conducted using Topic Modeling and Fuzzy C-means clustering which aim at measuring the underlying uncertainty using probability and membership based measures, respectively. Several Nonparametric hypothesis tests establish the effectiveness of the features obtained through Fuzzy C-Means clustering in multi-label classification. A new algorithm has been proposed for training the system for multi-label classification using the above set of features.

Electronics ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 554 ◽  
Author(s):  
Barbara Cardone ◽  
Ferdinando Di Martino

One of the main drawbacks of the well-known Fuzzy C-means clustering algorithm (FCM) is the random initialization of the centers of the clusters as it can significantly affect the performance of the algorithm, thus not guaranteeing an optimal solution and increasing execution times. In this paper we propose a variation of FCM in which the initial optimal cluster centers are obtained by implementing a weighted FCM algorithm in which the weights are assigned by calculating a Shannon Fuzzy Entropy function. The results of the comparison tests applied on various classification datasets of the UCI Machine Learning Repository show that our algorithm improved in all cases relating to the performances of FCM.


Algorithms ◽  
2021 ◽  
Vol 14 (9) ◽  
pp. 258
Author(s):  
Tran Dinh Khang ◽  
Manh-Kien Tran ◽  
Michael Fowler

Clustering is an unsupervised machine learning method with many practical applications that has gathered extensive research interest. It is a technique of dividing data elements into clusters such that elements in the same cluster are similar. Clustering belongs to the group of unsupervised machine learning techniques, meaning that there is no information about the labels of the elements. However, when knowledge of data points is known in advance, it will be beneficial to use a semi-supervised algorithm. Within many clustering techniques available, fuzzy C-means clustering (FCM) is a common one. To make the FCM algorithm a semi-supervised method, it was proposed in the literature to use an auxiliary matrix to adjust the membership grade of the elements to force them into certain clusters during the computation. In this study, instead of using the auxiliary matrix, we proposed to use multiple fuzzification coefficients to implement the semi-supervision component. After deriving the proposed semi-supervised fuzzy C-means clustering algorithm with multiple fuzzification coefficients (sSMC-FCM), we demonstrated the convergence of the algorithm and validated the efficiency of the method through a numerical example.


2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Krishnadas Nanath ◽  
Supriya Kaitheri ◽  
Sonia Malik ◽  
Shahid Mustafa

Purpose The purpose of this paper is to examine the factors that significantly affect the prediction of fake news from the virality theory perspective. The paper looks at a mix of emotion-driven content, sentimental resonance, topic modeling and linguistic features of news articles to predict the probability of fake news. Design/methodology/approach A data set of over 12,000 articles was chosen to develop a model for fake news detection. Machine learning algorithms and natural language processing techniques were used to handle big data with efficiency. Lexicon-based emotion analysis provided eight kinds of emotions used in the article text. The cluster of topics was extracted using topic modeling (five topics), while sentiment analysis provided the resonance between the title and the text. Linguistic features were added to the coding outcomes to develop a logistic regression predictive model for testing the significant variables. Other machine learning algorithms were also executed and compared. Findings The results revealed that positive emotions in a text lower the probability of news being fake. It was also found that sensational content like illegal activities and crime-related content were associated with fake news. The news title and the text exhibiting similar sentiments were found to be having lower chances of being fake. News titles with more words and content with fewer words were found to impact fake news detection significantly. Practical implications Several systems and social media platforms today are trying to implement fake news detection methods to filter the content. This research provides exciting parameters from a viral theory perspective that could help develop automated fake news detectors. Originality/value While several studies have explored fake news detection, this study uses a new perspective on viral theory. It also introduces new parameters like sentimental resonance that could help predict fake news. This study deals with an extensive data set and uses advanced natural language processing to automate the coding techniques in developing the prediction model.


2020 ◽  
Vol 2020 ◽  
pp. 1-22
Author(s):  
Yao Yang ◽  
Chengmao Wu ◽  
Yawen Li ◽  
Shaoyu Zhang

To improve the effectiveness and robustness of the existing semisupervised fuzzy clustering for segmenting image corrupted by noise, a kernel space semisupervised fuzzy C-means clustering segmentation algorithm combining utilizing neighborhood spatial gray information with fuzzy membership information is proposed in this paper. The mean intensity information of neighborhood window is embedded into the objective function of the existing semisupervised fuzzy C-means clustering, and the Lagrange multiplier method is used to obtain its iterative expression corresponding to the iterative solution of the optimization problem. Meanwhile, the local Gaussian kernel function is used to map the pixel samples from the Euclidean space to the high-dimensional feature space so that the cluster adaptability to different types of image segmentation is enhanced. Experiment results performed on different types of noisy images indicate that the proposed segmentation algorithm can achieve better segmentation performance than the existing typical robust fuzzy clustering algorithms and significantly enhance the antinoise performance.


Author(s):  
Duong Tran Duc ◽  
Pham Bao Son ◽  
Tan Hanh

Author profiling is the task of identifying characteristics of the author just based on a text document. In the previous works, there are a number of linguistic features such as character-based, word-based, grammar-based (often grouped as style-based), and content-based features (content words) have been exploited. The previous results showed that content-based features often achieved better results than style-based features. However, using content-based features is considered as a domain-specific approach, because the content words chosen often have meaning related to the studied domain. In this work, we investigate the use of syllables and rhymes as features for author profiling of Vietnamese text. They are parts of words, but have much less meaning than words, especially the rhymes. Therefore, these features can be considered much less domain-dependent than content words. We experimented on forum post datasets using machine learning approach. With improvement up to 8% compared with baseline results on style-based features, our method shows a new promising approach on author profiling.


Sign in / Sign up

Export Citation Format

Share Document