scholarly journals CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE APPROACH

Author(s):  
V. A. Ayma ◽  
R. S. Ferreira ◽  
P. Happ ◽  
D. Oliveira ◽  
R. Feitosa ◽  
...  

Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods, and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by remote sensors raises more challenges to be overcome. In this work, a tool within the scope of <i>InterIMAGE Cloud Platform (ICP)</i>, which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named <i>ICP: Data Mining Package</i>, is able to perform supervised classification procedures on huge amounts of data, usually referred as <i>big data</i>, on a distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA’s machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the potential of the tool, as well as aspects that affect its performance.

2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Yixue Zhu ◽  
Boyue Chai

With the development of increasingly advanced information technology and electronic technology, especially with regard to physical information systems, cloud computing systems, and social services, big data will be widely visible, creating benefits for people and at the same time facing huge challenges. In addition, with the advent of the era of big data, the scale of data sets is getting larger and larger. Traditional data analysis methods can no longer solve the problem of large-scale data sets, and the hidden information behind big data is digging out, especially in the field of e-commerce. We have become a key factor in competition among enterprises. We use a support vector machine method based on parallel computing to analyze the data. First, the training samples are divided into several working subsets through the SOM self-organizing neural network classification method. Compared with the ever-increasing progress of information technology and electronic equipment, especially the related physical information system finally merges the training results of each working set, so as to quickly deal with the problem of massive data prediction and analysis. This paper proposes that big data has the flexibility of expansion and quality assessment system, so it is meaningful to replace the double-sidedness of quality assessment with big data. Finally, considering the excellent performance of parallel support vector machines in data mining and analysis, we apply this method to the big data analysis of e-commerce. The research results show that parallel support vector machines can solve the problem of processing large-scale data sets. The emergence of data dirty problems has increased the effective rate by at least 70%.


2018 ◽  
Vol 7 (2.31) ◽  
pp. 190 ◽  
Author(s):  
S Belina V.J. Sara ◽  
K Kalaiselvi

Kidney Disease and kidney failure is the one of the complicated and challenging health issues regarding human health. Without having any symptoms few diseases are detected in later stages which results in dialysis. Advanced excavating technologies can always give various possibilities to deal with the situation by determining important realations and associations in drilling down health related data.   The prediction accuracy of classification algorithms depends upon appropriate Feature Selection (FS) algorithms decrease the number of features from collection of data. FS is the procedure of choosing the most relevant features, removing irrelevant features. To identify the Chronic Kidney Disease (CKD), Hybrid Wrapper and Filter based FS (HWFFS) algorithm is proposed to reduce the dimension of CKD dataset.   Filter based FS algorithm is performed based on the three major functions: Information Gain (IG), Correlation Based Feature Selection (CFS) and Consistency Based Subset Evaluation (CS) algorithms respectively. Wrapper based FS algorithm is performed based on the Enhanced Immune Clonal Selection (EICS) algorithm to choose most important features from the CKD dataset.  The results from these FS algorithms are combined with new HWFFS algorithm using classification threshold value.  Finally Support Vector Machine (SVM) based prediction algorithm be proposed in order to predict CKD and being evaluated on the MATLAB platform. The results demonstrated with the purpose of the SVM classifier by using HWFFS algorithm provides higher prediction rate in the diagnosis of CKD when compared to other classification algorithms.  


2012 ◽  
Vol 198-199 ◽  
pp. 1280-1285 ◽  
Author(s):  
Shang Fu Gong ◽  
Juan Chen

The widely use of P2P (Peer-to-Peer) technology has caused resources take up too much, security risks and other problems, it is necessary to detect and control P2P traffic. After analyzing current P2P detection methods, a new method called TCBDM (Traffic Characters Based Detection Method) is put forward which combines P2P traffic character with support vector machine to detect P2P traffic. By choosing P2P traffic characters which differ from other network traffic, such as Round-Trip Time (RTT), the method creates a SVM classifier, uses a package named LIBSVM to classify P2P traffic in Moore_Set data sets. The result shows that TCBDM can detect P2P traffic effectively; the accuracy could reach 98%.


2017 ◽  
Vol 58 (3-4) ◽  
pp. 231-237
Author(s):  
CHENG WANG ◽  
FEILONG CAO

The error of a distributed algorithm for big data classification with a support vector machine (SVM) is analysed in this paper. First, the given big data sets are divided into small subsets, on which the classical SVM with Gaussian kernels is used. Then, the classification error of the SVM for each subset is analysed based on the Tsybakov exponent, geometric noise, and width of the Gaussian kernels. Finally, the whole error of the distributed algorithm is estimated in terms of the error of each subset.


2020 ◽  
Vol 23 (65) ◽  
pp. 86-99
Author(s):  
Suresh K ◽  
Karthik S ◽  
Hanumanthappa M

With the progressions in Information and Communication Technology (ICT), the innumerable electronic devices (like smart sensors) and several software applications can proffer notable contributions to the challenges that are existent in monitoring plants. In the prevailing work, the segmentation accuracy and classification accuracy of the Disease Monitoring System (DMS), is low. So, the system doesn't properly monitor the plant diseases. To overcome such drawbacks, this paper proposed an efficient monitoring system for paddy leaves based on big data mining. The proposed model comprises 5 phases: 1) Image acquisition, 2) segmentation, 3) Feature extraction, 4) Feature Selection along with 5) Classification Validation. Primarily, consider the paddy leaf image which is taken as of the dataset as the input. Then, execute image acquisition phase where 3 steps like, i) transmute RGB image to grey scale image, ii) Normalization for high intensity, and iii) preprocessing utilizing Alpha-trimmed mean filter (ATMF) through which the noises are eradicated and its nature is the hybrid of the mean as well as median filters, are performed. Next, segment the resulting image using Fuzzy C-Means (i.e. FCM) Clustering Algorithm. FCM segments the diseased portion in the paddy leaves. In the next phase, features are extorted, and then the resulted features are chosen by utilizing Multi-Verse Optimization (MVO) algorithm. After completing feature selection, the chosen features are classified utilizing ANFIS (Adaptive Neuro-Fuzzy Inference System). Experiential results contrasted with the former SVM classifier (Support Vector Machine) and the prevailing methods in respect of precision, recall, F-measure,sensitivity accuracy, and specificity. In accuracy level, the proposed one has 97.28% but the prevailing techniques only offer 91.2% for SVM classifier, 85.3% for KNN and 88.78% for ANN. Hence, this proposed DMS has more accurate detection and classification process than the other methods. The proposed DMS evinces better accuracy when contrasting with the prevailing methods.


Heart arrhythmias are the different types of heartbeats which are irregular in nature. In Tachycardia the heartbeat works too fast and in case of Bradycardia it works too slow. In the study of different cardiac conditions automatic detection of heart arrhythmia is done by the classification and feature extraction of Electrocardiogram(ECG) data. Various Support Vector Machine based methods are used to analyze and classify ECG signals for arrhythmia detection. There are several Support Vector Machine (SVM) methods used to classify the ECG data such as one against all, one against one and fuzzy decision function. This classification detects the existence of the arrhythmia and it helps the physicians to treat the heart patient with more accurate way. To train SVM, the MIT BIH Arrhythmia database is used which works with the heart disorder like sinus bradycardy, old inferior myocardial infarction, coronary artery disease, right bundle branch block. All three methods are implemented in proper way, and their rate of accuracy with SVM classifier is optimal when it is processed with the one-against-all method. The data sets of ECG arrhythmia are usually complex in nature, so for the SVM based classification one-against-all method has great impact and will fetch better result.


Author(s):  
Lakshmi Sarvani Videla ◽  
M. Ashok Kumar P

The detection of person fatigue is one of the important tasks to detect drowsiness in the domain of image processing. Though lots of work has been carried out in this regard, there is a void of work shows the exact correctness. In this chapter, the main objective is to present an efficient approach that is a combination of both eye state detection and yawn in unconstrained environments. In the first proposed method, the face region and then eyes and mouth are detected. Histograms of Oriented Gradients (HOG) features are extracted from detected eyes. These features are fed to Support Vector Machine (SVM) classifier that classifies the eye state as closed or not closed. Distance between intensity changes in the mouth map is used to detect yawn. In second proposed method, off-the-shelf face detectors and facial landmark detectors are used to detect the features, and a novel eye and mouth metric is proposed. The eye results obtained are checked for consistency with yawn detection results in both the proposed methods. If any one of the results is indicating fatigue, the result is considered as fatigue. Second proposed method outperforms first method on two standard data sets.


2018 ◽  
Vol 61 (1) ◽  
pp. 64-76 ◽  
Author(s):  
Susan (Sixue) Jia

Fitness clubs have never ceased searching for quality improvement opportunities to better serve their exercisers, whereas exercisers have been posting online ratings and reviews regarding fitness clubs. Studied together, the quantitative rating and qualitative review can provide a comprehensive depiction of exercisers’ perception of fitness clubs. However, the typological and dimensional discrepancies of online rating and review have hindered the joint study of the two data sets to fully exploit their business value. To this end, this study bridges the gap by examined 53,979 pairs of exerciser online rating and review from 100 fitness clubs in Shanghai, China. Using latent Dirichlet allocation (LDA) based text mining, we identified the 17 major topics on which the exercisers were writing. A support vector machine (SVM) classifier was then employed to establish the rating-review relations, with an accuracy rate of up to 86%. Finally, the relative impact of each topic on exerciser satisfaction was computed and compared by introducing virtual reviews. The significance of this study is that it systematically creates a standardized protocol of mining and correlating the massive structured/quantitative and unstructured/qualitative data available online, which is readily transferable to the other service and product sectors.


2016 ◽  
Vol 13 (10) ◽  
pp. 6929-6934
Author(s):  
Junting Chen ◽  
Liyun Zhong ◽  
Caiyun Cai

Word sense disambiguation (WSD) in natural language text is a fundamental semantic understanding task at the lexical level in natural language processing (NLP) applications. Kernel methods such as support vector machine (SVM) have been successfully applied to WSD. This is mainly due to their relatively high classification accuracy as well as their ability to handle high dimensional and sparse data. A significant challenge in WSD is to reduce the need for labeled training data while maintaining an acceptable performance. In this paper, we present a semi-supervised technique using the exponential kernel for WSD. Specifically, the semantic similarities between terms are first determined with both labeled and unlabeled training data by means of a diffusion process on a graph defined by lexicon and co-occurrence information, and the exponential kernel is then constructed based on the learned semantic similarity. Finally, the SVM classifier trains a model for each class during the training phase and this model is then applied to all test examples in the test phase. The main feature of this approach is that it takes advantage of the exponential kernel to reveal the semantic similarities between terms in an unsupervised manner, which provides a kernel framework for semi-supervised learning. Experiments on several SENSEVAL benchmark data sets demonstrate the proposed approach is sound and effective.


2010 ◽  
Vol 19 (05) ◽  
pp. 647-677 ◽  
Author(s):  
LAURA DIOŞAN ◽  
ALEXANDRINA ROGOZAN ◽  
JEAN-PIERRE PECUCHET

Classic kernel-based classifiers use only a single kernel, but the real-world applications have emphasized the need to consider a combination of kernels — also known as a multiple kernel (MK) — in order to boost the classification accuracy by adapting better to the characteristics of the data. Our purpose is to automatically design a complex multiple kernel by evolutionary means. In order to achieve this purpose we propose a hybrid model that combines a Genetic Programming (GP) algorithm and a kernel-based Support Vector Machine (SVM) classifier. In our model, each GP chromosome is a tree that encodes the mathematical expression of a multiple kernel. The evolutionary search process of the optimal MK is guided by the fitness function (or efficiency) of each possible MK. The complex multiple kernels which are evolved in this manner (eCMKs) are compared to several classic simple kernels (SKs), to a convex linear multiple kernel (cLMK) and to an evolutionary linear multiple kernel (eLMK) on several real-world data sets from UCI repository. The numerical experiments show that the SVM involving the evolutionary complex multiple kernels perform better than the classic simple kernels. Moreover, on the considered data sets, the new multiple kernels outperform both the cLMK and eLMK — linear multiple kernels. These results emphasize the fact that the SVM algorithm requires a combination of kernels more complex than a linear one in order to boost its performance.


Sign in / Sign up

Export Citation Format

Share Document