INOD: A Graph-Based Outlier Detection Algorithm

The outlier detection is to select uncommon data from a data set, which can significantly improve the quality of results for the data mining algorithms. A typical feature of the outliers is that they are always far away from a majority of data in the data set. In this paper, we present a graph-based outlier detection algorithm named INOD, which makes use of this feature of the outlier. The DistMean-neighborhood is used to calculate the cumulative in-degree for each data. The data, whose cumulative in-degree is smaller than a threshold, is judged as an outlier candidate. A KNN-based selection algorithm is used to determine the final outlier. Experimental results show that the INOD algorithm can improve the precision 80% higher and decrease the error rate 75% lower than the classical LOF algorithm.

Download Full-text

A Survey of Feature Selection Techniques

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch289 ◽

2011 ◽

pp. 1888-1895 ◽

Cited By ~ 17

Author(s):

Barak Chizi ◽

Lior Rokach ◽

Oded Maimon

Keyword(s):

Data Mining ◽

Feature Selection ◽

Mining Method ◽

Data Set ◽

Data Mining Method ◽

Data Mining Algorithms ◽

Wrapper Approach ◽

Computationally Intensive ◽

Filter Approach ◽

Mining Algorithms

Dimensionality (i.e., the number of data set attributes or groups of attributes) constitutes a serious obstacle to the efficiency of most data mining algorithms (Maimon and Last, 2000). The main reason for this is that data mining algorithms are computationally intensive. This obstacle is sometimes known as the “curse of dimensionality” (Bellman, 1961). The objective of Feature Selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since Feature Selection reduces the dimensionality of the data, data mining algorithms can be operated faster and more effectively by using Feature Selection. In some cases, as a result of feature selection, the performance of the data mining method can be improved. The reason for that is mainly a more compact, easily interpreted representation of the target concept. The filter approach (Kohavi , 1995; Kohavi and John ,1996) operates independently of the data mining method employed subsequently -- undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. A sub-category of filter methods that will be refer to as rankers, are methods that employ some criterion to score each feature and provide a ranking. From this ordering, several feature subsets can be chosen by manually setting There are three main approaches for feature selection: wrapper, filter and embedded. The wrapper approach (Kohavi, 1995; Kohavi and John,1996), uses an inducer as a black box along with a statistical re-sampling technique such as cross-validation to select the best feature subset according to some predictive measure. The embedded approach (see for instance Guyon and Elisseeff, 2003) is similar to the wrapper approach in the sense that the features are specifically selected for a certain inducer, but it selects the features in the process of learning.

Download Full-text

Local Outlier Detection Algorithm Based on Coefficient of Variation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.635-637.1723 ◽

2014 ◽

Vol 635-637 ◽

pp. 1723-1728

Author(s):

Shi Bo Zhou ◽

Wei Xiang Xu

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Coefficient Of Variation ◽

Detection Algorithm ◽

Experimental Results ◽

Data Set ◽

Outliers Detection ◽

Deviation Factor ◽

Local Deviation ◽

Local Outlier

Local outliers detection is an important issue in data mining. By analyzing the limitations of the existing outlier detection algorthms, a local outlier detection algorthm based on coefficient of variation is introduced. This algorthms applies K-means which is strong in outliers searching, divides data set into sections, puts outliers and their nearing clusters into a local neighbourhood, then figures out the local deviation factor of each local neighbourhood by coefficient of variation, as a result, local outliers can more likely be found.The heoretic analysis and experimental results indicate that the method is ef fective and efficient.

Download Full-text

A Comparative Evaluation of Meta Classification Algorithms with Smokers Lung Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.36.24543 ◽

2018 ◽

Vol 7 (4.36) ◽

pp. 845 ◽

Cited By ~ 1

Author(s):

K. Kavitha ◽

K. Rohini ◽

G. Suseendran

Keyword(s):

Data Mining ◽

Web Mining ◽

Lung Function Test ◽

Performance Comparison ◽

Business Organizations ◽

Science Data ◽

Data Set ◽

Data Mining Algorithms ◽

Report Data ◽

Mining Algorithms

Data mining is the course of process during which knowledge is extracted through interesting patterns recognized from large amount of data. It is one of the knowledge exploring areas which is widely used in the field of computer science. Data mining is an inter-disciplinary area which has great impact on various other fields such as data analytics in business organizations, medical forecasting and diagnosis, market analysis, statistical analysis and forecasting, predictive analysis in various other fields. Data mining has multiple forms such as text mining, web mining, visual mining, spatial mining, knowledge mining and distributed mining. In general the process of data mining has many tasks from pre-processing. The actual task of data mining starts after the preprocessing task. This work deals with the analysis and comparison of the various Data mining algorithms particularly Meta classifiers based upon performance and accuracy. This work is under medical domain, which is using the lung function test report data along with the smoking data. This medical data set has been created from the raw data obtained from the hospital. In this paper work, we have analyzed the performance of Meta classifiers for classifying the files. Initially the performances of Meta and Rule classifiers are analyzed observed and found that the Meta classifier is more efficient than the Rule classifiers in Weka tool. The implementation work then continued with the performance comparison between the different types of classification algorithm among which the Meta classifiers showed comparatively higher accuracy in the process of classification. The four Meta classifier algorithms which are widely explored using the Weka tool namely Bagging, Attribute Selected Classifier, Logit Boost and Classification via Regression are used to classify this medical dataset and the result so obtained has been evaluated and compared to recognize the best among the classifier.

Download Full-text

Determination of Voting Tendencies in Turkey through Data Mining Algorithms

International Journal of E-Adoption ◽

10.4018/ijea.2017010105 ◽

2017 ◽

Vol 9 (1) ◽

pp. 50-58

Author(s):

Ali Bayır ◽

Sebnem Ozdemir ◽

Sevinç Gülseçen

Keyword(s):

Data Mining ◽

Economic Status ◽

Educational Status ◽

Data Set ◽

Data Mining Algorithms ◽

C4.5 Decision Tree ◽

Survey Results ◽

Mining Methods ◽

Using Data ◽

Mining Algorithms

Political elections can be defined as systems that contain political tendencies and voters' perceptions and preferences. The outputs of those systems are formed by specific attributes of individuals such as age, gender, occupancy, educational status, socio-economic status, religious belief, etc. Those attributes can create a data set, which contains hidden information and undiscovered patterns that can be revealed by using data mining methods and techniques. The main purpose of this study is to define voting tendencies in politics by using some of data mining methods. According to that purpose, the survey results, which were prepared and applied before 2011 elections of Turkey by KONDA Research and Consultancy Company, were used as raw data set. After Preprocessing of data, models were generated via data mining algorithms, such as Gini, C4.5 Decision Tree, Naive Bayes and Random Forest. Because of increasing popularity and flexibility in analyzing process, R language and Rstudio environment were used.

Download Full-text

Body Weight Prediction of Thalli Sheep Reared in Southern Punjab Using Different Data Mining Algorithms

Proceedings of the Pakistan Academy of Sciences: A. Physical and Computational Sciences ◽

10.53560/ppasa(58-2)603 ◽

2021 ◽

Vol 58 (2) ◽

pp. 29-38

Author(s):

Ansar Abbas ◽

Muhammad Aman Ullah ◽

Abdul Waheed

Keyword(s):

Data Mining ◽

Body Weight ◽

Goodness Of Fit ◽

The Body ◽

Classification And Regression Tree ◽

Body Measurements ◽

Data Set ◽

Data Mining Algorithms ◽

Exhaustive Chaid ◽

Mining Algorithms

This study is conducted to predict the body weight (BW) for Thalli sheep of southern Punjab from different body measurements. In the BW prediction, several body measurements viz., withers height, body length, head length, head width, ear length, ear width, neck length, neck width, heart girth, rump length, rump width, tail length, barrel depth and sacral pelvic width are used as predictors. The data mining algorithms such as Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID, Classification and Regression Tree (CART) and Artificial Neural Network (ANN) are used to predict the BW for a total of 85 female Thalli sheep. The data set is partitioned into training (80 %) and test (20 %) sets before the algorithms are used. The minimum number of parent (4) and child nodes (2) are set in order to ensure their predictive ability. The R2 % and RMSE values for CHAID, Exhaustive CHAID, ANN and CART algorithms are 67.38(1.003), 64.37(1.049), 61.45(1.093) and 59.02(1.125), respectively. The mostsignificant predictor is BL in the BW prediction of Thalli sheep. The heaviest BW average of 9.596 kg is obtained from the subgroup of those having BL > 25.000 inches. On behalf of the several goodness of fit criteria, we conclude that the CHAID algorithm performance is better in order to predict the BW of Thalli sheep and more suitable decision tree diagram visually. Also, the obtained CHAID results may help to determine body measurements positively associated with BW for developing better selection strategies with the scope of indirect selection criteria.

Download Full-text

A Comparative Study of Outlier Detection in Large-Scale Data Using Data Mining Algorithms

International Journal Of Data Mining And Emerging Technologies ◽

10.5958/2249-3220.2017.00002.7 ◽

2017 ◽

Vol 7 (1) ◽

pp. 10

Author(s):

V. Saranya ◽

R. Umagandhi

Keyword(s):

Data Mining ◽

Comparative Study ◽

Outlier Detection ◽

Large Scale ◽

Data Mining Algorithms ◽

Large Scale Data ◽

Using Data ◽

Mining Algorithms ◽

Scale Data

Download Full-text

Learning analytics made in France: the METAL project

International Journal of Information and Learning Technology ◽

10.1108/ijilt-02-2019-0022 ◽

2019 ◽

Vol 36 (4) ◽

pp. 299-313 ◽

Cited By ~ 2

Author(s):

Armelle Brun ◽

Geoffray Bonnin ◽

Sylvain Castagnos ◽

Azim Roussanaly ◽

Anne Boyer

Keyword(s):

Data Mining ◽

Secondary School ◽

Data Storage ◽

Learning Analytics ◽

Content Type ◽

Data Mining Algorithms ◽

School Based ◽

Source Data ◽

Mining Algorithms

Purpose The purpose of this paper is to present the METAL project, a French open learning analytics (LA) project for secondary school, that aims at improving the quality of teaching. The originality of METAL is that it relies on research through exploratory activities and focuses on all the aspects of a learning analytics environment. Design/methodology/approach This work introduces the different concerns of the project: collection and storage of multi-source data owned by a variety of stakeholders, selection and promotion of standards, design of an open-source LRS, conception of dashboards with their final users, trust, usability, design of explainable multi-source data-mining algorithms. Findings All the dimensions of METAL are presented, as well as the way they are approached: data sources, data storage, through the implementation of an LRS, design of dashboards for secondary school, based on co-design sessions data mining algorithms and experiments, in line with privacy and ethics concerns. Originality/value The issue of a global dissemination of LA at an institution level or at a broader level such as a territory or a study level is still a hot topic in the literature, and is one of the focus and originality of this paper, associated with the large spectrum of different concerns.

Download Full-text

Identifying Decision Structures Underlying Activity Patterns: An Exploration of Data Mining Algorithms

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1718-01 ◽

2000 ◽

Vol 1718 (1) ◽

pp. 1-9 ◽

Cited By ~ 39

Author(s):

Geert Wets ◽

Koen Vanhoof ◽

Theo Arentze ◽

Harry Timmermans

Keyword(s):

Data Mining ◽

Decision Tree ◽

Logit Model ◽

Goodness Of Fit ◽

Travel Demand ◽

Activity Patterns ◽

Future Research ◽

Data Set ◽

Data Mining Algorithms ◽

Mining Algorithms

The utility-maximizing framework—in particular, the logit model—is the dominantly used framework in transportation demand modeling. Computational process modeling has been introduced as an alternative approach to deal with the complexity of activity-based models of travel demand. Current rule-based systems, however, lack a methodology to derive rules from data. The relevance and performance of data-mining algorithms that potentially can provide the required methodology are explored. In particular, the C4 algorithm is applied to derive a decision tree for transport mode choice in the context of activity scheduling from a large activity diary data set. The algorithm is compared with both an alternative method of inducing decision trees (CHAID) and a logit model on the basis of goodness-of-fit on the same data set. The ratio of correctly predicted cases of a holdout sample is almost identical for the three methods. This suggests that for data sets of comparable complexity, the accuracy of predictions does not provide grounds for either rejecting or choosing the C4 method. However, the method may have advantages related to robustness. Future research is required to determine the ability of decision tree-based models in predicting behavioral change.

Download Full-text

There and back again: Outlier detection between statistical reasoning and data mining algorithms

Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery ◽

10.1002/widm.1280 ◽

2018 ◽

Vol 8 (6) ◽

Cited By ~ 14

Author(s):

Arthur Zimek ◽

Peter Filzmoser

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Statistical Reasoning ◽

Data Mining Algorithms ◽

Mining Algorithms

Download Full-text

Workload Optimization by Horizontal Aggregation in SQL for Data Mining Analysis

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217263 ◽

2021 ◽

pp. 304-309

Author(s):

Prasanna M. Rathod ◽

Prof. Dr. Anjali B. Raut

Keyword(s):

Data Mining ◽

Relational Algebra ◽

Data Migration ◽

Data Sets ◽

Data Set ◽

Application Performance ◽

Data Mining Algorithms ◽

Mining Project ◽

Pivot Methods ◽

Mining Algorithms

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: ? CASE: Exploiting the programming CASE construct; ? SPJ: Based on standard relational algebra operators (SPJ queries); ? PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not. For query optimization the distance computation and nearest cluster in the k-means are based on SQL. Workload balancing is the assignment of work to processors in a way that maximizes application performance. The process of load balancing can be generalized into four basic steps: 1. Monitoring processor load and state; 2. Exchanging workload and state information between processors; 3. Decision making; 4. Data migration. The decision phase is triggered when the load imbalance is detected to calculate optimal data redistribution. In the fourth and last phase, data migrates from overloaded processors to under-loaded ones.

Download Full-text