Performance Implications of Knowledge Discovery Techniques in Databases

Author(s):  
Balaji Rajagopalan ◽  
Ravindra Krovi

This chapter introduces knowledge discovery techniques as a means of identifying critical trends and patterns for business decision support. It suggests that effective implementation of these techniques requires a careful assessment of the various data mining tools and algorithms available. Both statistical and machine-learning based algorithms have been widely applied to discover knowledge from data. In this chapter we describe some of these algorithms and investigate their relative performance for classification problems. Simulation based results support the proposition that machine-learning algorithms outperform their statistical counterparts, albeit only under certain conditions. Further, the authors hope that the discussion on performance related issues will foster a better understanding of the application and appropriateness of knowledge discovery techniques.

Author(s):  
Balaji Rajagopalan ◽  
Ravi Krovi

Data mining is the process of sifting through the mass of organizational (internal and external) data to identify patterns critical for decision support. Successful implementation of the data mining effort requires a careful assessment of the various tools and algorithms available. The basic premise of this study is that machine-learning algorithms, which are assumption free, should outperform their traditional counterparts when mining business databases. The objective of this study is to test this proposition by investigating the performance of the algorithms for several scenarios. The scenarios are based on simulations designed to reflect the extent to which typical statistical assumptions are violated in the business domain. The results of the computational experiments support the proposition that machine learning algorithms generally outperform their statistical counterparts under certain conditions. These can be used as prescriptive guidelines for the applicability of data mining techniques.


Entropy ◽  
2021 ◽  
Vol 23 (4) ◽  
pp. 485 ◽  
Author(s):  
Carlos A. Palacios ◽  
José A. Reyes-Suárez ◽  
Lorena A. Bearzotti ◽  
Víctor Leiva ◽  
Carolina Marchant

Data mining is employed to extract useful information and to detect patterns from often large data sets, closely related to knowledge discovery in databases and data science. In this investigation, we formulate models based on machine learning algorithms to extract relevant information predicting student retention at various levels, using higher education data and specifying the relevant variables involved in the modeling. Then, we utilize this information to help the process of knowledge discovery. We predict student retention at each of three levels during their first, second, and third years of study, obtaining models with an accuracy that exceeds 80% in all scenarios. These models allow us to adequately predict the level when dropout occurs. Among the machine learning algorithms used in this work are: decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines, of which the random forest technique performs the best. We detect that secondary educational score and the community poverty index are important predictive variables, which have not been previously reported in educational studies of this type. The dropout assessment at various levels reported here is valid for higher education institutions around the world with similar conditions to the Chilean case, where dropout rates affect the efficiency of such institutions. Having the ability to predict dropout based on student’s data enables these institutions to take preventative measures, avoiding the dropouts. In the case study, balancing the majority and minority classes improves the performance of the algorithms.


Author(s):  
Anantvir Singh Romana

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.


Student Performance Management is one of the key pillars of the higher education institutions since it directly impacts the student’s career prospects and college rankings. This paper follows the path of learning analytics and educational data mining by applying machine learning techniques in student data for identifying students who are at the more likely to fail in the university examinations and thus providing needed interventions for improved student performance. The Paper uses data mining approach with 10 fold cross validation to classify students based on predictors which are demographic and social characteristics of the students. This paper compares five popular machine learning algorithms Rep Tree, Jrip, Random Forest, Random Tree, Naive Bayes algorithms based on overall classifier accuracy as well as other class specific indicators i.e. precision, recall, f-measure. Results proved that Rep tree algorithm outperformed other machine learning algorithms in classifying students who are at more likely to fail in the examinations.


2021 ◽  
Vol 297 ◽  
pp. 01032
Author(s):  
Harish Kumar ◽  
Anshal Prasad ◽  
Ninad Rane ◽  
Nilay Tamane ◽  
Anjali Yeole

Phishing is a common attack on credulous people by making them disclose their unique information. It is a type of cyber-crime where false sites allure exploited people to give delicate data. This paper deals with methods for detecting phishing websites by analyzing various features of URLs by Machine learning techniques. This experimentation discusses the methods used for detection of phishing websites based on lexical features, host properties and page importance properties. We consider various data mining algorithms for evaluation of the features in order to get a better understanding of the structure of URLs that spread phishing. To protect end users from visiting these sites, we can try to identify the phishing URLs by analyzing their lexical and host-based features.A particular challenge in this domain is that criminals are constantly making new strategies to counter our defense measures. To succeed in this contest, we need Machine Learning algorithms that continually adapt to new examples and features of phishing URLs.


2021 ◽  
Vol 336 ◽  
pp. 06024
Author(s):  
Nan Liang ◽  
Qing Liang ◽  
Fenglei Ji

Traditional Chinese Medicine (TCM) has attracted more and more attention due to its remarkable effects on treating diseases, and Chinese herbal medicine (CHM) is an important partition of TCM, rich in natural active ingredients. Researchers are trying multiple analytical methods to dig out more valuable information about CHM and reveal the principle of TCM. Machine learning is playing an important role in the studies. Knowledge discovery of CHM using machine learning mainly includes quality control of CHM, network pharmacology in CHM, and medical prescriptions composed by CHM, aiming to understand TCM better, provide more efficiency methods in the production of CHM and find novel treatment of disease not curable nowadays. In this paper, we summarized the basic idea of frequently used classification and clustering machine learning algorithms, introduced pre-processing algorithms commonly used to simplify and accelerate machine learning procedure, presented current status of machine learning algorithms’ applications in knowledge discovery of CHM, discussed challenges and future trends of machine learning’s application in CHM. It is believed that the paper provides a valuable insight for the starters trying to apply machine learning in the study of CHM and catch up the recent status of related researches.


Author(s):  
Alex Freitas ◽  
André C.P.L.F. de Carvalho

In machine learning and data mining, most of the works in classification problems deal with flat classification, where each instance is classified in one of a set of possible classes and there is no hierarchical relationship between the classes. There are, however, more complex classification problems where the classes to be predicted are hierarchically related. This chapter presents a tutorial on the hierarchical classification techniques found in the literature. We also discuss how hierarchical classification techniques have been applied to the area of bioinformatics (particularly the prediction of protein function), where hierarchical classification problems are often found.


Author(s):  
Kağan Okatan

All these types of analytics have been answering business questions for a long time about the principal methods of investigating data warehouses. Especially data mining and business intelligence systems support decision makers to reach the information they want. Many existing systems are trying to keep up with a phenomenon that has changed the rules of the game in recent years. This is undoubtedly the undeniable attraction of 'big data'. In particular, the issue of evaluating the big data generated especially by social media is among the most up-to-date issues of business analytics, and this issue demonstrates the importance of integrating machine learning into business analytics. This section introduces the prominent machine learning algorithms that are increasingly used for business analytics and emphasizes their application areas.


Sign in / Sign up

Export Citation Format

Share Document