Identifying Decision Structures Underlying Activity Patterns: An Exploration of Data Mining Algorithms

Author(s):  
Geert Wets ◽  
Koen Vanhoof ◽  
Theo Arentze ◽  
Harry Timmermans

The utility-maximizing framework—in particular, the logit model—is the dominantly used framework in transportation demand modeling. Computational process modeling has been introduced as an alternative approach to deal with the complexity of activity-based models of travel demand. Current rule-based systems, however, lack a methodology to derive rules from data. The relevance and performance of data-mining algorithms that potentially can provide the required methodology are explored. In particular, the C4 algorithm is applied to derive a decision tree for transport mode choice in the context of activity scheduling from a large activity diary data set. The algorithm is compared with both an alternative method of inducing decision trees (CHAID) and a logit model on the basis of goodness-of-fit on the same data set. The ratio of correctly predicted cases of a holdout sample is almost identical for the three methods. This suggests that for data sets of comparable complexity, the accuracy of predictions does not provide grounds for either rejecting or choosing the C4 method. However, the method may have advantages related to robustness. Future research is required to determine the ability of decision tree-based models in predicting behavioral change.

Author(s):  
Ansar Abbas ◽  
Muhammad Aman Ullah ◽  
Abdul Waheed

This study is conducted to predict the body weight (BW) for Thalli sheep of southern Punjab from different body measurements. In the BW prediction, several body measurements viz., withers height, body length, head length, head width, ear length, ear width, neck length, neck width, heart girth, rump length, rump width, tail length, barrel depth and sacral pelvic width are used as predictors. The data mining algorithms such as Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID, Classification and Regression Tree (CART) and Artificial Neural Network (ANN) are used to predict the BW for a total of 85 female Thalli sheep. The data set is partitioned into training (80 %) and test (20 %) sets before the algorithms are used. The minimum number of parent (4) and child nodes (2) are set in order to ensure their predictive ability. The R2 % and RMSE values for CHAID, Exhaustive CHAID, ANN and CART algorithms are 67.38(1.003), 64.37(1.049), 61.45(1.093) and 59.02(1.125), respectively. The mostsignificant predictor is BL in the BW prediction of Thalli sheep. The heaviest BW average of 9.596 kg is obtained from the subgroup of those having BL > 25.000 inches. On behalf of the several goodness of fit criteria, we conclude that the CHAID algorithm performance is better in order to predict the BW of Thalli sheep and more suitable decision tree diagram visually. Also, the obtained CHAID results may help to determine body measurements positively associated with BW for developing better selection strategies with the scope of indirect selection criteria.


BioResources ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 4891-4904
Author(s):  
Selahattin Bardak ◽  
Timucin Bardak ◽  
Hüseyin Peker ◽  
Eser Sözen ◽  
Yildiz Çabuk

Wood materials have been used in many products such as furniture, stairs, windows, and doors for centuries. There are differences in methods used to adapt wood to ambient conditions. Impregnation is a widely used method of wood preservation. In terms of efficiency, it is critical to optimize the parameters for impregnation. Data mining techniques reduce most of the cost and operational challenges with accurate prediction in the wood industry. In this study, three data-mining algorithms were applied to predict bending strength in impregnated wood materials (Pinus sylvestris L. and Millettia laurentii). Models were created from real experimental data to examine the relationship between bending strength, diffusion time, vacuum duration, and wood type, based on decision trees (DT), random forest (RF), and Gaussian process (GP) algorithms. The highest bending strength was achieved with wenge (Millettia laurentii) wood in 10 bar vacuum and the diffusion condition during 25 min. The results showed that all algorithms are suitable for predicting bending strength. The goodness of fit for the testing phase was determined as 0.994, 0.986, and 0.989 in the DT, RF, and GP algorithms, respectively. Moreover, the importance of attributes was determined in the algorithms.


Author(s):  
Barak Chizi ◽  
Lior Rokach ◽  
Oded Maimon

Dimensionality (i.e., the number of data set attributes or groups of attributes) constitutes a serious obstacle to the efficiency of most data mining algorithms (Maimon and Last, 2000). The main reason for this is that data mining algorithms are computationally intensive. This obstacle is sometimes known as the “curse of dimensionality” (Bellman, 1961). The objective of Feature Selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since Feature Selection reduces the dimensionality of the data, data mining algorithms can be operated faster and more effectively by using Feature Selection. In some cases, as a result of feature selection, the performance of the data mining method can be improved. The reason for that is mainly a more compact, easily interpreted representation of the target concept. The filter approach (Kohavi , 1995; Kohavi and John ,1996) operates independently of the data mining method employed subsequently -- undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. A sub-category of filter methods that will be refer to as rankers, are methods that employ some criterion to score each feature and provide a ranking. From this ordering, several feature subsets can be chosen by manually setting There are three main approaches for feature selection: wrapper, filter and embedded. The wrapper approach (Kohavi, 1995; Kohavi and John,1996), uses an inducer as a black box along with a statistical re-sampling technique such as cross-validation to select the best feature subset according to some predictive measure. The embedded approach (see for instance Guyon and Elisseeff, 2003) is similar to the wrapper approach in the sense that the features are specifically selected for a certain inducer, but it selects the features in the process of learning.


2018 ◽  
Vol 7 (4.36) ◽  
pp. 845 ◽  
Author(s):  
K. Kavitha ◽  
K. Rohini ◽  
G. Suseendran

Data mining is the course of process during which knowledge is extracted through interesting patterns recognized from large amount of data. It is one of the knowledge exploring areas which is widely used in the field of computer science. Data mining is an inter-disciplinary area which has great impact on various other fields such as data analytics in business organizations, medical forecasting and diagnosis, market analysis, statistical analysis and forecasting, predictive analysis in various other fields. Data mining has multiple forms such as text mining, web mining, visual mining, spatial mining, knowledge mining and distributed mining. In general the process of data mining has many tasks from pre-processing. The actual task of data mining starts after the preprocessing task. This work deals with the analysis and comparison of the various Data mining algorithms particularly Meta classifiers based upon performance and accuracy. This work is under medical domain, which is using the lung function test report data along with the smoking data. This medical data set has been created from the raw data obtained from the hospital. In this paper work, we have analyzed the performance of Meta classifiers for classifying the files. Initially the performances of Meta and Rule classifiers are analyzed observed and found that the Meta classifier is more efficient than the Rule classifiers in Weka tool. The implementation work then continued with the performance comparison between the different types of classification algorithm among which the Meta classifiers showed comparatively higher accuracy in the process of classification. The four Meta classifier algorithms which are widely explored using the Weka tool namely Bagging, Attribute Selected Classifier, Logit Boost and Classification via Regression are used to classify this medical dataset and the result so obtained has been evaluated and compared to recognize the best among the classifier.  


Author(s):  
Moloud Abdar ◽  
Sharareh R. Niakan Kalhori ◽  
Tole Sutikno ◽  
Imam Much Ibnu Subroto ◽  
Goli Arji

Heart diseases are among the nation’s leading couse of mortality and moribidity. Data mining teqniques can predict the likelihood of patients getting a heart disease. The purpose of this study is comparison of different data mining algorithm on prediction of heart diseases. This work applied and compared data mining techniques to predict the risk of heart diseases. After feature analysis, models by five algorithms including decision tree (C5.0), neural network, support vector machine (SVM), logistic regression and k-nearest neighborhood (KNN) were developed and validated. C5.0 Decision tree has been able to build a model with greatest accuracy 93.02%, KNN, SVM, Neural network have been 88.37%, 86.05% and 80.23% respectively. Produced results of decision tree can be simply interpretable and applicable; their rules can be understood easily by different clinical practitioner.


2017 ◽  
Vol 9 (1) ◽  
pp. 50-58
Author(s):  
Ali Bayır ◽  
Sebnem Ozdemir ◽  
Sevinç Gülseçen

Political elections can be defined as systems that contain political tendencies and voters' perceptions and preferences. The outputs of those systems are formed by specific attributes of individuals such as age, gender, occupancy, educational status, socio-economic status, religious belief, etc. Those attributes can create a data set, which contains hidden information and undiscovered patterns that can be revealed by using data mining methods and techniques. The main purpose of this study is to define voting tendencies in politics by using some of data mining methods. According to that purpose, the survey results, which were prepared and applied before 2011 elections of Turkey by KONDA Research and Consultancy Company, were used as raw data set. After Preprocessing of data, models were generated via data mining algorithms, such as Gini, C4.5 Decision Tree, Naive Bayes and Random Forest. Because of increasing popularity and flexibility in analyzing process, R language and Rstudio environment were used.


2009 ◽  
Vol 131 (3) ◽  
Author(s):  
Haiyang Zheng ◽  
Andrew Kusiak

In this paper, multivariate time series models were built to predict the power ramp rates of a wind farm. The power changes were predicted at 10 min intervals. Multivariate time series models were built with data-mining algorithms. Five different data-mining algorithms were tested using data collected at a wind farm. The support vector machine regression algorithm performed best out of the five algorithms studied in this research. It provided predictions of the power ramp rate for a time horizon of 10–60 min. The boosting tree algorithm selects parameters for enhancement of the prediction accuracy of the power ramp rate. The data used in this research originated at a wind farm of 100 turbines. The test results of multivariate time series models were presented in this paper. Suggestions for future research were provided.


Author(s):  
Prasanna M. Rathod ◽  
Prof. Dr. Anjali B. Raut

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: ? CASE: Exploiting the programming CASE construct; ? SPJ: Based on standard relational algebra operators (SPJ queries); ? PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not. For query optimization the distance computation and nearest cluster in the k-means are based on SQL. Workload balancing is the assignment of work to processors in a way that maximizes application performance. The process of load balancing can be generalized into four basic steps: 1. Monitoring processor load and state; 2. Exchanging workload and state information between processors; 3. Decision making; 4. Data migration. The decision phase is triggered when the load imbalance is detected to calculate optimal data redistribution. In the fourth and last phase, data migrates from overloaded processors to under-loaded ones.


Author(s):  
Ali H. Gazala ◽  
Waseem Ahmad

Multi-Relational Data Mining or MRDM is a growing research area focuses on discovering hidden patterns and useful knowledge from relational databases. While the vast majority of data mining algorithms and techniques look for patterns in a flat single-table data representation, the sub-domain of MRDM looks for patterns that involve multiple tables (relations) from a relational database. This sub-domain has received an increased research attention during the last two decades due to the wide range of possible applications. As a result of that growing attention, many successful multi-relational data mining algorithms and techniques were presented. This chapter presents a comprehensive review about multi-relational data mining. It discusses the different approaches researchers have followed to explore the relational search space while highlighting some of the most significant challenges facing researchers working in this sub-domain. The chapter also describes number of MRDM systems that have been developed during the last few years and discusses some future research directions in this sub-domain.


Sign in / Sign up

Export Citation Format

Share Document