Predictive Data Mining

Author(s):  
Sotiris Kotsiantis ◽  
Panayotis Pintelas

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Machine learning (ML) provides the technical basis of data mining. It is used to extract information from the raw data in databases—information that is expressed in a comprehensible form and can be used for a variety of purposes. Every instance in any data set used by ML algorithms is represented using the same set of features. The features may be continuous, categorical, or binary. If instances are given with known labels (the corresponding correct outputs), then the learning is called supervised in contrast to unsupervised learning, where instances are unlabeled (Kotsiantis & Pintelas, 2004). This work is concerned with regression problems in which the output of instances admits real values instead of discrete values in classification problems.

Author(s):  
Alex Freitas ◽  
André C.P.L.F. de Carvalho

In machine learning and data mining, most of the works in classification problems deal with flat classification, where each instance is classified in one of a set of possible classes and there is no hierarchical relationship between the classes. There are, however, more complex classification problems where the classes to be predicted are hierarchically related. This chapter presents a tutorial on the hierarchical classification techniques found in the literature. We also discuss how hierarchical classification techniques have been applied to the area of bioinformatics (particularly the prediction of protein function), where hierarchical classification problems are often found.


2020 ◽  
Author(s):  
Ben Geoffrey A S ◽  
Akhil Sanker ◽  
Host Antony Davidd ◽  
Judith Gracia

Our work is composed of a python program for automatic data mining of PubChem database to collect data associated with the corona virus drug target replicase polyprotein 1ab (UniProt identifier : POC6X7 ) of data set involving active compounds, their activity value (IC50) and their chemical/molecular descriptors to run a machine learning based AutoQSAR algorithm on the data set to generate anti-corona viral drug leads. The machine learning based AutoQSAR algorithm involves feature selection, QSAR modelling, validation and prediction. The drug leads generated each time the program is run is reflective of the constantly growing PubChem database is an important dynamic feature of the program which facilitates fast and dynamic anti-corona viral drug lead generation reflective of the constantly growing PubChem database. The program prints out the top anti-corona viral drug leads after screening PubChem library which is over a billion compounds. The interaction of top drug lead compounds generated by the program and two corona viral drug target proteins, 3-Cystiene like Protease (3CLPro) and Papain like protease (PLpro) was studied and analysed using molecular docking tools. The compounds generated as drug leads by the program showed favourable interaction with the drug target proteins and thus we recommend the program for use in anti-corona viral compound drug lead generation as it helps reduce the complexity of virtual screening and ushers in an age of automatic ease in drug lead generation. The leads generated by the program can further be tested for drug potential through further In Silico, In Vitro and In Vivo testing <div><br></div><div><div>The program is hosted, maintained and supported at the GitHub repository link given below</div><div><br></div><div>https://github.com/bengeof/Drug-Discovery-P0C6X7</div></div><div><br></div>


Author(s):  
Zeynel Abidin Çil ◽  
Abdullah Caliskan

Emergency departments of hospitals are busy. In recent years, patient arrivals have significantly risen at emergency departments in Turkey like other countries in the world. The main important features of emergency services are uninterrupted service, providing services in a short time, and priority to emergency patients. However, patients who do not need immediate treatment can sometimes apply to this department due to several reasons like working time and short waiting time. This situation can reduce efficiency and effectiveness at emergency departments. On the other hand, computers solve complex classification problems by using machine learning methods. The methods have a wide range of applications, such as computational biology and computer vision. Therefore, classification of emergency and non-emergency patients is vital to increase productivity of the department. This chapter tries to find the best classifier for detection of emergency patients by utilizing a data set.


2020 ◽  
pp. 214-244
Author(s):  
Prithish Banerjee ◽  
Mark Vere Culp ◽  
Kenneth Jospeh Ryan ◽  
George Michailidis

This chapter presents some popular graph-based semi-supervised approaches. These techniques apply to classification and regression problems and can be extended to big data problems using recently developed anchor graph enhancements. The background necessary for understanding this Chapter includes linear algebra and optimization. No prior knowledge in methods of machine learning is necessary. An empirical demonstration of the techniques for these methods is also provided on real data set benchmarks.


Author(s):  
Kazheen Ismael Taher ◽  
Adnan Mohsin Abdulazeez ◽  
Dilovan Asaad Zebari

Rapid changes are occurring in our global ecosystem, and stresses on human well-being, such as climate regulation and food production, are increasing, soil is a critical component of agriculture. The project aims to use Data Mining (DM) classification techniques to predict soil data. Analysis DM classification strategies such as k-Nearest-Neighbors (k-NN), Random-Forest (RF), Decision-Tree (DT) and Naïve-Bayes (NB) are used to predict soil type. These classifier algorithms are used to extract information from soil data. The main purpose of using these classifiers is to find the optimal machine learning classifier in the soil classification. in this paper we are applying some algorithms of DM and machine learning on the data set that we collected by using Weka program, then we compare the experimental result with other papers that worked like our work.  According to the experimental results, the highest accuracy is k-NN has of 84 % when compared to the NB (69.23%), DT and RF (53.85 %). As a result, it outperforms the other classifiers. The findings imply that k-NN could be useful for accurate soil type classification in the agricultural domain.


2019 ◽  
Vol 5 (2) ◽  
pp. 108-119
Author(s):  
Yeslam Al-Saggaf ◽  
Amanda Davies

Purpose The purpose of this paper is to discuss the design, application and findings of a case study in which the application of a machine learning algorithm is utilised to identify the grievances in Twitter in an Arabian context. Design/methodology/approach To understand the characteristics of the Twitter users who expressed the identified grievances, data mining techniques and social network analysis were utilised. The study extracted a total of 23,363 tweets and these were stored as a data set. The machine learning algorithm applied to this data set was followed by utilising a data mining process to explore the characteristics of the Twitter feed users. The network of the users was mapped and the individual level of interactivity and network density were calculated. Findings The machine learning algorithm revealed 12 themes all of which were underpinned by the coalition of Arab countries blockade of Qatar. The data mining analysis revealed that the tweets could be clustered in three clusters, the main cluster included users with a large number of followers and friends but who did not mention other users in their tweets. The social network analysis revealed that whilst a large proportion of users engaged in direct messages with others, the network ties between them were not registered as strong. Practical implications Borum (2011) notes that invoking grievances is the first step in the radicalisation process. It is hoped that by understanding these grievances, the study will shed light on what radical groups could invoke to win the sympathy of aggrieved people. Originality/value In combination, the machine learning algorithm offered insights into the grievances expressed within the tweets in an Arabian context. The data mining and the social network analyses revealed the characteristics of the Twitter users highlighting identifying and managing early intervention of radicalisation.


Author(s):  
Yan Zhao ◽  
Yiyu Yao

Classification is one of the main tasks in machine learning, data mining, and pattern recognition. Compared with the extensively studied automation approaches, the interactive approaches, centered on human users, are less explored. This chapter studies interactive classification at 3 levels. At the philosophical level, the motivations and a process-based framework of interactive classification are proposed. At the technical level, a granular computing model is suggested for re-examining not only existing classification problems, but also interactive classification problems. At the application level, an interactive classification system (ICS), using a granule network as the search space, is introduced. ICS allows multi-strategies for granule tree construction, and enhances the understanding and interpretation of the classification process. Interactive classification is complementary to the existing classification methods.


Data Mining ◽  
2013 ◽  
pp. 1534-1544
Author(s):  
M. Govindarajan ◽  
RM. Chandrasekaran

Data Mining is the use of algorithms to extract the information and patterns derived by the knowledge discovery in database process. It is often referred to as supervised learning because the classes are determined before examining the data. In many data mining applications that address classification problems, feature and model selection are considered as key tasks. That is, appropriate input features of the classifier must be selected from a given set of possible features and structure parameters of the classifier must be adapted with respect to these features and a given data set. This paper describes feature selection and model selection simultaneously for Multilayer Perceptron (MLP) classifiers. In order to reduce the optimization effort, various techniques are integrated that accelerate and improve the classifier significantly. The feasibility and the benefits of the proposed approach are demonstrated by means of data mining problem: Direct Marketing in Customer Relationship Management. It is shown that, compared to earlier MLP technique, the run time is reduced with respect to learning data and with validation data for the proposed Multilayer Perceptron (MLP) classifiers. Similarly, the error rate is relatively low with respect to learning data and with validation data in direct marketing dataset. The algorithm is independent of specific applications so that many ideas and solutions can be transferred to other classifier paradigms.


Author(s):  
Stamatios-Aggelos N. Alexandropoulos ◽  
Sotiris B. Kotsiantis ◽  
Michael N. Vrahatis

AbstractA large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.


2020 ◽  
Vol 2020 ◽  
pp. 1-8
Author(s):  
Omer F. Akmese ◽  
Gul Dogan ◽  
Hakan Kor ◽  
Hasan Erbay ◽  
Emre Demir

Acute appendicitis is one of the most common emergency diseases in general surgery clinics. It is more common, especially between the ages of 10 and 30 years. Additionally, approximately 7% of the entire population is diagnosed with acute appendicitis at some time in their lives and requires surgery. The study aims to develop an easy, fast, and accurate estimation method for early acute appendicitis diagnosis using machine learning algorithms. Retrospective clinical records were analyzed with predictive data mining models. The predictive success of the models obtained by various machine learning algorithms was compared. A total of 595 clinical records were used in the study, including 348 males (58.49%) and 247 females (41.51%). It was found that the gradient boosted trees algorithm achieves the best success with an accurate prediction success of 95.31%. In this study, an estimation method based on machine learning was developed to identify individuals with acute appendicitis. It is thought that this method will benefit patients with signs of appendicitis, especially in emergency departments in hospitals.


Sign in / Sign up

Export Citation Format

Share Document