Active Learning for Constrained Document Clustering with Uncertainty Region

Constrained clustering is intended to improve accuracy and personalization based on the constraints expressed by an Oracle. In this paper, a new constrained clustering algorithm is proposed and some of the informative data pairs are selected during an iterative process. Then, they are presented to the Oracle and their relation is answered with “Must-link (ML) or Cannot-link (CL).” In each iteration, first, the support vector machine (SVM) is utilized based on the label produced by the current clustering. According to the distance of each document from the hyperplane, the distance matrix is created. Also, based on cosine similarity of word2vector of each document, the similarity matrix is created. Two types of probability (similarity and degree of similarity) are calculated and they are smoothed for belonging to neighborhoods. Neighborhoods form the samples that are labeled by Oracle, to be in the same cluster. Finally, at the end of each iteration, the data with a greater level of uncertainty (in term of probability) is selected for questioning the oracle. In order to evaluate, the proposed method is compared with famous state-of-the-art methods based on two criteria and over a standard dataset. The result demonstrates an increased accuracy and stability of the obtained result with fewer questions.

Download Full-text

A Credit Conflict Detection Model Based on Decision Distance and Probability Matrix

Wireless Communications and Mobile Computing ◽

10.1155/2022/3795183 ◽

2022 ◽

Vol 2022 ◽

pp. 1-7

Author(s):

Xiaodong Zhang ◽

Congdong Lv ◽

Zhoubao Sun

Keyword(s):

Influencing Factors ◽

Elderly Care ◽

Distance Matrix ◽

Conflict Detection ◽

Support Vector ◽

Similarity Matrix ◽

Detection Model ◽

Model Based ◽

Internet Finance ◽

The Impact

Considering the credit index calculation differences, semantic differences, false data, and other problems between platforms such as Internet finance, e-commerce, and health and elderly care, which lead to the credit deviation from the trusted range of credit subjects and the lack of related information of credit subjects, in this paper, we proposed a crossplatform service credit conflict detection model based on the decision distance to support the migration and application of crossplatform credit information transmission and integration. Firstly, we give a scoring table of influencing factors. Score is the probability of the impact of this factor on credit. Through this probability, the distance matrix between influencing factors is generated. Secondly, the similarity matrix is calculated from the distance matrix. Thirdly, the support vector is calculated through the similarity matrix. Fourth, the credit vector is calculated by the support vector. Finally, the credibility is calculated by the credit vector and probability.

Download Full-text

Synthesizing Conjunctive & Disjunctive Linear Invariants by K-means++ and SVM

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/6/3 ◽

2020 ◽

Vol 17 (6) ◽

pp. 847-856

Author(s):

Shengbing Ren ◽

Xiang Zhang

Keyword(s):

Software Verification ◽

State Of The Art ◽

Positive Sample ◽

Machine Learning Algorithms ◽

Support Vector ◽

Hoare Logic ◽

Excellent Performance ◽

Automated Software ◽

Inductive Invariants ◽

Linear Invariants

The problem of synthesizing adequate inductive invariants lies at the heart of automated software verification. The state-of-the-art machine learning algorithms for synthesizing invariants have gradually shown its excellent performance. However, synthesizing disjunctive invariants is a difficult task. In this paper, we propose a method k++ Support Vector Machine (SVM) integrating k-means++ and SVM to synthesize conjunctive and disjunctive invariants. At first, given a program, we start with executing the program to collect program states. Next, k++SVM adopts k-means++ to cluster the positive samples and then applies SVM to distinguish each positive sample cluster from all negative samples to synthesize the candidate invariants. Finally, a set of theories founded on Hoare logic are adopted to check whether the candidate invariants are true invariants. If the candidate invariants fail the check, we should sample more states and repeat our algorithm. The experimental results show that k++SVM is compatible with the algorithms for Intersection Of Half-space (IOH) and more efficient than the tool of Interproc. Furthermore, it is shown that our method can synthesize conjunctive and disjunctive invariants automatically

Download Full-text

Enhanced clustering algorithm based on fuzzy C-means and support vector machine

Journal of Computer Applications ◽

10.3724/sp.j.1087.2013.00991 ◽

2013 ◽

Vol 33 (4) ◽

pp. 991-993

Author(s):

Lei HU ◽

Qinzhou NIU ◽

Yan CHEN

Keyword(s):

Support Vector Machine ◽

Clustering Algorithm ◽

Support Vector ◽

Fuzzy C Means

Download Full-text

Application of Machine Learning in Animal Disease Analysis and Prediction

Current Bioinformatics ◽

10.2174/1574893615999200728195613 ◽

2020 ◽

Vol 15 ◽

Author(s):

Shuwen Zhang ◽

Qiang Su ◽

Qin Chen

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Clustering Algorithm ◽

Principal Component ◽

Support Vector ◽

Animal Disease ◽

Human Beings ◽

Animal Diseases ◽

Disease Analysis

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.

Download Full-text

Segmentation of SAR Image using Fuzzy C-Means and Filters

Science & Technology Journal ◽

10.22232/stj.2020.08.01.11 ◽

2020 ◽

Vol 8 (1) ◽

pp. 84-90

Author(s):

R. Lalchhanhima ◽

◽

Debdatta Kandar ◽

R. Chawngsangpuii ◽

Vanlalmuansangi Khenglawt ◽

...

Keyword(s):

Clustering Algorithm ◽

State Of The Art ◽

Speckle Noise ◽

Synthetic Aperture Radar Image ◽

Synthetic Aperture ◽

Sar Image ◽

Spatial Filters ◽

Fuzzy C Means ◽

Automatic Clustering ◽

Intensity Information

Fuzzy C-Means is an unsupervised clustering algorithm for the automatic clustering of data. Synthetic Aperture Radar Image Segmentation has been a challenging task because of the presence of speckle noise. Therefore the segmentation process can not directly rely on the intensity information alone but must consider several derived features in order to get satisfactory segmentation results. In this paper, it is attempted to use the fuzzy nature of classification for the purpose of unsupervised region segmentation in which FCM is employed. Different features are obtained by filtering of the image by using different spatial filters and are selected for segmentation criteria. The segmentation performance is determined by the accuracy compared with a different state of the art techniques proposed recently.

Download Full-text

Similarity detection of English text and teaching evaluation based on improved TCUSS clustering algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189576 ◽

2020 ◽

pp. 1-11

Author(s):

Yu Wang

Keyword(s):

Language Processing ◽

Calculation Method ◽

Clustering Algorithm ◽

Research Result ◽

Structural Features ◽

English Text ◽

Support Vector ◽

Similarity Calculation ◽

Other Information ◽

Calculation Task

The semantic similarity calculation task of English text has important influence on other fields of natural language processing and has high research value and application prospect. At present, research on the similarity calculation of short texts has achieved good results, but the research result on long text sets is still poor. This paper proposes a similarity calculation method that combines planar features with structured features and uses support vector regression models. Moreover, this paper uses PST and PDT to represent the syntax, semantics and other information of the text. In addition, through the two structural features suitable for text similarity calculation, this paper proposes a similarity calculation method combining structural features with Tree-LSTM model. Experiments show that this method provides a new idea for interest network extraction.

Download Full-text

Pinball Loss Twin Support Vector Clustering

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3409264 ◽

2021 ◽

Vol 17 (2s) ◽

pp. 1-23

Author(s):

M. Tanveer ◽

Tarun Gupta ◽

Miten Shah ◽

Keyword(s):

Loss Function ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Structural Mri ◽

Twin Support Vector Machine ◽

Support Vector ◽

Support Vector Clustering ◽

Hinge Loss ◽

Pinball Loss ◽

Vector Clustering

Twin Support Vector Clustering (TWSVC) is a clustering algorithm inspired by the principles of Twin Support Vector Machine (TWSVM). TWSVC has already outperformed other traditional plane based clustering algorithms. However, TWSVC uses hinge loss, which maximizes shortest distance between clusters and hence suffers from noise-sensitivity and low re-sampling stability. In this article, we propose Pinball loss Twin Support Vector Clustering (pinTSVC) as a clustering algorithm. The proposed pinTSVC model incorporates the pinball loss function in the plane clustering formulation. Pinball loss function introduces favorable properties such as noise-insensitivity and re-sampling stability. The time complexity of the proposed pinTSVC remains equivalent to that of TWSVC. Extensive numerical experiments on noise-corrupted benchmark UCI and artificial datasets have been provided. Results of the proposed pinTSVC model are compared with TWSVC, Twin Bounded Support Vector Clustering (TBSVC) and Fuzzy c-means clustering (FCM). Detailed and exhaustive comparisons demonstrate the better performance and generalization of the proposed pinTSVC for noise-corrupted datasets. Further experiments and analysis on the performance of the above-mentioned clustering algorithms on structural MRI (sMRI) images taken from the ADNI database, face clustering, and facial expression clustering have been done to demonstrate the effectiveness and feasibility of the proposed pinTSVC model.

Download Full-text

A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

Data ◽

10.3390/data6080087 ◽

2021 ◽

Vol 6 (8) ◽

pp. 87

Author(s):

Sara Ferreira ◽

Mário Antunes ◽

Manuel E. Correia

Keyword(s):

Machine Learning ◽

Digital Forensics ◽

State Of The Art ◽

Forensic Analysis ◽

Third Party ◽

Support Vector ◽

Multimedia Content ◽

Digital Forensic ◽

Video Frames ◽

Forensic Tools

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.

Download Full-text

New Multi-View Classification Method with Uncertain Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458282 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Bo Liu ◽

Haowen Zhong ◽

Yanshan Xiao

Keyword(s):

Learning Strategy ◽

State Of The Art ◽

Uncertain Data ◽

Real Life ◽

Support Vector ◽

Classification Methods ◽

Complementary Information ◽

Novel Approach ◽

Svm Model ◽

Iterative Framework

Multi-view classification aims at designing a multi-view learning strategy to train a classifier from multi-view data, which are easily collected in practice. Most of the existing works focus on multi-view classification by assuming the multi-view data are collected with precise information. However, we always collect the uncertain multi-view data due to the collection process is corrupted with noise in real-life application. In this case, this article proposes a novel approach, called uncertain multi-view learning with support vector machine (UMV-SVM) to cope with the problem of multi-view learning with uncertain data. The method first enforces the agreement among all the views to seek complementary information of multi-view data and takes the uncertainty of the multi-view data into consideration by modeling reachability area of the noise. Then it proposes an iterative framework to solve the proposed UMV-SVM model such that we can obtain the multi-view classifier for prediction. Extensive experiments on real-life datasets have shown that the proposed UMV-SVM can achieve a better performance for uncertain multi-view classification in comparison to the state-of-the-art multi-view classification methods.

Download Full-text

Malaria parasite detection in thick blood smear microscopic images using modified YOLOV3 and YOLOV4 models

BMC Bioinformatics ◽

10.1186/s12859-021-04036-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Fetulhak Abdurahman ◽

Kinde Anlay Fante ◽

Mohammed Aliy

Keyword(s):

Object Detection ◽

Malaria Parasite ◽

Blood Smear ◽

Clustering Algorithm ◽

State Of The Art ◽

Detection Accuracy ◽

Small Object ◽

Thick Blood Smear ◽

Malaria Parasites ◽

Microscopic Images

Abstract Background Manual microscopic examination of Leishman/Giemsa stained thin and thick blood smear is still the “gold standard” for malaria diagnosis. One of the drawbacks of this method is that its accuracy, consistency, and diagnosis speed depend on microscopists’ diagnostic and technical skills. It is difficult to get highly skilled microscopists in remote areas of developing countries. To alleviate this problem, in this paper, we propose to investigate state-of-the-art one-stage and two-stage object detection algorithms for automated malaria parasite screening from microscopic image of thick blood slides. Results YOLOV3 and YOLOV4 models, which are state-of-the-art object detectors in accuracy and speed, are not optimized for detecting small objects such as malaria parasites in microscopic images. We modify these models by increasing feature scale and adding more detection layers to enhance their capability of detecting small objects without notably decreasing detection speed. We propose one modified YOLOV4 model, called YOLOV4-MOD and two modified models of YOLOV3, which are called YOLOV3-MOD1 and YOLOV3-MOD2. Besides, new anchor box sizes are generated using K-means clustering algorithm to exploit the potential of these models in small object detection. The performance of the modified YOLOV3 and YOLOV4 models were evaluated on a publicly available malaria dataset. These models have achieved state-of-the-art accuracy by exceeding performance of their original versions, Faster R-CNN, and SSD in terms of mean average precision (mAP), recall, precision, F1 score, and average IOU. YOLOV4-MOD has achieved the best detection accuracy among all the other models with a mAP of 96.32%. YOLOV3-MOD2 and YOLOV3-MOD1 have achieved mAP of 96.14% and 95.46%, respectively. Conclusions The experimental results of this study demonstrate that performance of modified YOLOV3 and YOLOV4 models are highly promising for detecting malaria parasites from images captured by a smartphone camera over the microscope eyepiece. The proposed system is suitable for deployment in low-resource setting areas.

Download Full-text