Predictive Data Mining

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Machine learning (ML) provides the technical basis of data mining. It is used to extract information from the raw data in databases—information that is expressed in a comprehensible form and can be used for a variety of purposes. Every instance in any data set used by ML algorithms is represented using the same set of features. The features may be continuous, categorical, or binary. If instances are given with known labels (the corresponding correct outputs), then the learning is called supervised in contrast to unsupervised learning, where instances are unlabeled (Kotsiantis & Pintelas, 2004). This work is concerned with regression problems in which the output of instances admits real values instead of discrete values in classification problems.

Download Full-text

A Tutorial on Hierarchical Classification with Applications in Bioinformatics

Intelligent Information Technologies ◽

10.4018/978-1-59904-941-0.ch006 ◽

2011 ◽

pp. 114-140

Author(s):

Alex Freitas ◽

André C.P.L.F. de Carvalho

Keyword(s):

Machine Learning ◽

Data Mining ◽

Protein Function ◽

Hierarchical Classification ◽

Classification Problems ◽

Classification Techniques ◽

Hierarchical Relationship

In machine learning and data mining, most of the works in classification problems deal with flat classification, where each instance is classified in one of a set of possible classes and there is no hierarchical relationship between the classes. There are, however, more complex classification problems where the classes to be predicted are hierarchically related. This chapter presents a tutorial on the hierarchical classification techniques found in the literature. We also discuss how hierarchical classification techniques have been applied to the area of bioinformatics (particularly the prediction of protein function), where hierarchical classification problems are often found.

Download Full-text

A Program for Automated Data Mining of PubChem to Screen a Billion Compounds and Generate by Machine Learning Based AutoQSAR Algorithm Anti-Corona Viral Drug Leads (Replicase Polyprotein 1ab Inhibitors) and in Silico Study of the Top Drug Lead Compounds

10.26434/chemrxiv.12423638.v1 ◽

2020 ◽

Author(s):

Ben Geoffrey A S ◽

Akhil Sanker ◽

Host Antony Davidd ◽

Judith Gracia

Keyword(s):

Machine Learning ◽

Data Mining ◽

In Silico ◽

Drug Target ◽

Data Set ◽

Lead Compounds ◽

Pubchem Database ◽

Drug Lead ◽

Lead Generation ◽

Drug Leads

Our work is composed of a python program for automatic data mining of PubChem database to collect data associated with the corona virus drug target replicase polyprotein 1ab (UniProt identifier : POC6X7 ) of data set involving active compounds, their activity value (IC50) and their chemical/molecular descriptors to run a machine learning based AutoQSAR algorithm on the data set to generate anti-corona viral drug leads. The machine learning based AutoQSAR algorithm involves feature selection, QSAR modelling, validation and prediction. The drug leads generated each time the program is run is reflective of the constantly growing PubChem database is an important dynamic feature of the program which facilitates fast and dynamic anti-corona viral drug lead generation reflective of the constantly growing PubChem database. The program prints out the top anti-corona viral drug leads after screening PubChem library which is over a billion compounds. The interaction of top drug lead compounds generated by the program and two corona viral drug target proteins, 3-Cystiene like Protease (3CLPro) and Papain like protease (PLpro) was studied and analysed using molecular docking tools. The compounds generated as drug leads by the program showed favourable interaction with the drug target proteins and thus we recommend the program for use in anti-corona viral compound drug lead generation as it helps reduce the complexity of virtual screening and ushers in an age of automatic ease in drug lead generation. The leads generated by the program can further be tested for drug potential through further In Silico, In Vitro and In Vivo testing <div><br></div><div><div>The program is hosted, maintained and supported at the GitHub repository link given below</div><div><br></div><div>https://github.com/bengeof/Drug-Discovery-P0C6X7</div></div><div><br></div>

Download Full-text

Machine Learning Applications for Classification Emergency and Non-Emergency Patients

Advances in Healthcare Information Systems and Administration - Computational Intelligence and Soft Computing Applications in Healthcare Management Science ◽

10.4018/978-1-7998-2581-4.ch006 ◽

2020 ◽

pp. 104-120

Author(s):

Zeynel Abidin Çil ◽

Abdullah Caliskan

Keyword(s):

Machine Learning ◽

Emergency Departments ◽

Emergency Services ◽

Classification Problems ◽

Data Set ◽

Emergency Patients ◽

Efficiency And Effectiveness ◽

Machine Learning Applications ◽

Increase Productivity ◽

Short Time

Emergency departments of hospitals are busy. In recent years, patient arrivals have significantly risen at emergency departments in Turkey like other countries in the world. The main important features of emergency services are uninterrupted service, providing services in a short time, and priority to emergency patients. However, patients who do not need immediate treatment can sometimes apply to this department due to several reasons like working time and short waiting time. This situation can reduce efficiency and effectiveness at emergency departments. On the other hand, computers solve complex classification problems by using machine learning methods. The methods have a wide range of applications, such as computational biology and computer vision. Therefore, classification of emergency and non-emergency patients is vital to increase productivity of the department. This chapter tries to find the best classifier for detection of emergency patients by utilizing a data set.

Download Full-text

Graph-Based Semi-Supervised Learning With Big Data

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch012 ◽

2020 ◽

pp. 214-244

Author(s):

Prithish Banerjee ◽

Mark Vere Culp ◽

Kenneth Jospeh Ryan ◽

George Michailidis

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Learning ◽

Prior Knowledge ◽

Linear Algebra ◽

Real Data ◽

Data Set ◽

Regression Problems ◽

Classification And Regression ◽

Empirical Demonstration

This chapter presents some popular graph-based semi-supervised approaches. These techniques apply to classification and regression problems and can be extended to big data problems using recently developed anchor graph enhancements. The background necessary for understanding this Chapter includes linear algebra and optimization. No prior knowledge in methods of machine learning is necessary. An empirical demonstration of the techniques for these methods is also provided on real data set benchmarks.

Download Full-text

Data Mining Classification Algorithms for Analyzing Soil Data

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v8i230196 ◽

2021 ◽

pp. 17-28

Author(s):

Kazheen Ismael Taher ◽

Adnan Mohsin Abdulazeez ◽

Dilovan Asaad Zebari

Keyword(s):

Machine Learning ◽

Data Mining ◽

Soil Type ◽

Soil Classification ◽

Well Being ◽

Experimental Result ◽

K Nearest Neighbors ◽

Data Set ◽

Climate Regulation ◽

Learning Classifier

Rapid changes are occurring in our global ecosystem, and stresses on human well-being, such as climate regulation and food production, are increasing, soil is a critical component of agriculture. The project aims to use Data Mining (DM) classification techniques to predict soil data. Analysis DM classification strategies such as k-Nearest-Neighbors (k-NN), Random-Forest (RF), Decision-Tree (DT) and Naïve-Bayes (NB) are used to predict soil type. These classifier algorithms are used to extract information from soil data. The main purpose of using these classifiers is to find the optimal machine learning classifier in the soil classification. in this paper we are applying some algorithms of DM and machine learning on the data set that we collected by using Weka program, then we compare the experimental result with other papers that worked like our work. According to the experimental results, the highest accuracy is k-NN has of 84 % when compared to the NB (69.23%), DT and RF (53.85 %). As a result, it outperforms the other classifiers. The findings imply that k-NN could be useful for accurate soil type classification in the agricultural domain.

Download Full-text

Understanding the expression of grievances in the Arabic Twitter-sphere using machine learning

Journal of Criminological Research Policy and Practice ◽

10.1108/jcrpp-02-2019-0009 ◽

2019 ◽

Vol 5 (2) ◽

pp. 108-119

Author(s):

Yeslam Al-Saggaf ◽

Amanda Davies

Keyword(s):

Machine Learning ◽

Data Mining ◽

Social Network ◽

Network Analysis ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Data Set ◽

Content Type ◽

The Social ◽

Twitter Users

Purpose The purpose of this paper is to discuss the design, application and findings of a case study in which the application of a machine learning algorithm is utilised to identify the grievances in Twitter in an Arabian context. Design/methodology/approach To understand the characteristics of the Twitter users who expressed the identified grievances, data mining techniques and social network analysis were utilised. The study extracted a total of 23,363 tweets and these were stored as a data set. The machine learning algorithm applied to this data set was followed by utilising a data mining process to explore the characteristics of the Twitter feed users. The network of the users was mapped and the individual level of interactivity and network density were calculated. Findings The machine learning algorithm revealed 12 themes all of which were underpinned by the coalition of Arab countries blockade of Qatar. The data mining analysis revealed that the tweets could be clustered in three clusters, the main cluster included users with a large number of followers and friends but who did not mention other users in their tweets. The social network analysis revealed that whilst a large proportion of users engaged in direct messages with others, the network ties between them were not registered as strong. Practical implications Borum (2011) notes that invoking grievances is the first step in the radicalisation process. It is hoped that by understanding these grievances, the study will shed light on what radical groups could invoke to win the sympathy of aggrieved people. Originality/value In combination, the machine learning algorithm offered insights into the grievances expressed within the tweets in an Arabian context. The data mining and the social network analyses revealed the characteristics of the Twitter users highlighting identifying and managing early intervention of radicalisation.

Download Full-text

Interactive Classification Using a Granule Network

Novel Approaches in Cognitive Informatics and Natural Intelligence ◽

10.4018/978-1-60566-170-4.ch016 ◽

2011 ◽

pp. 235-245

Author(s):

Yan Zhao ◽

Yiyu Yao

Keyword(s):

Machine Learning ◽

Data Mining ◽

Pattern Recognition ◽

Classification System ◽

Search Space ◽

Classification Methods ◽

Classification Problems ◽

Computing Model ◽

Tree Construction ◽

Learning Data

Classification is one of the main tasks in machine learning, data mining, and pattern recognition. Compared with the extensively studied automation approaches, the interactive approaches, centered on human users, are less explored. This chapter studies interactive classification at 3 levels. At the philosophical level, the motivations and a process-based framework of interactive classification are proposed. At the technical level, a granular computing model is suggested for re-examining not only existing classification problems, but also interactive classification problems. At the application level, an interactive classification system (ICS), using a granule network as the search space, is introduced. ICS allows multi-strategies for granule tree construction, and enhances the understanding and interpretation of the classification process. Interactive classification is complementary to the existing classification methods.

Download Full-text

A Hybrid Multilayer Perceptron Neural Network for Direct Marketing

Data Mining ◽

10.4018/978-1-4666-2455-9.ch080 ◽

2013 ◽

pp. 1534-1544

Author(s):

M. Govindarajan ◽

RM. Chandrasekaran

Keyword(s):

Data Mining ◽

Model Selection ◽

Multilayer Perceptron ◽

Direct Marketing ◽

Customer Relationship ◽

Classification Problems ◽

Validation Data ◽

Data Set ◽

Knowledge Discovery In Database ◽

Learning Data

Data Mining is the use of algorithms to extract the information and patterns derived by the knowledge discovery in database process. It is often referred to as supervised learning because the classes are determined before examining the data. In many data mining applications that address classification problems, feature and model selection are considered as key tasks. That is, appropriate input features of the classifier must be selected from a given set of possible features and structure parameters of the classifier must be adapted with respect to these features and a given data set. This paper describes feature selection and model selection simultaneously for Multilayer Perceptron (MLP) classifiers. In order to reduce the optimization effort, various techniques are integrated that accelerate and improve the classifier significantly. The feasibility and the benefits of the proposed approach are demonstrated by means of data mining problem: Direct Marketing in Customer Relationship Management. It is shown that, compared to earlier MLP technique, the run time is reduced with respect to learning data and with validation data for the proposed Multilayer Perceptron (MLP) classifiers. Similarly, the error rate is relatively low with respect to learning data and with validation data in direct marketing dataset. The algorithm is independent of specific applications so that many ideas and solutions can be transferred to other classifier paradigms.

Download Full-text

Data preprocessing in predictive data mining

The Knowledge Engineering Review ◽

10.1017/s026988891800036x ◽

2019 ◽

Vol 34 ◽

Cited By ~ 9

Author(s):

Stamatios-Aggelos N. Alexandropoulos ◽

Sotiris B. Kotsiantis ◽

Michael N. Vrahatis

Keyword(s):

Machine Learning ◽

Data Mining ◽

Processing Time ◽

Data Preprocessing ◽

Difficult Problem ◽

Data Preparation ◽

Learning Tasks ◽

Effective Performance ◽

Predictive Data Mining

AbstractA large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.

Download Full-text

The Use of Machine Learning Approaches for the Diagnosis of Acute Appendicitis

Emergency Medicine International ◽

10.1155/2020/7306435 ◽

2020 ◽

Vol 2020 ◽

pp. 1-8

Author(s):

Omer F. Akmese ◽

Gul Dogan ◽

Hakan Kor ◽

Hasan Erbay ◽

Emre Demir

Keyword(s):

Machine Learning ◽

Data Mining ◽

Acute Appendicitis ◽

Learning Algorithms ◽

Estimation Method ◽

Machine Learning Algorithms ◽

Accurate Estimation ◽

Learning Approaches ◽

Predictive Data Mining ◽

Clinical Records

Acute appendicitis is one of the most common emergency diseases in general surgery clinics. It is more common, especially between the ages of 10 and 30 years. Additionally, approximately 7% of the entire population is diagnosed with acute appendicitis at some time in their lives and requires surgery. The study aims to develop an easy, fast, and accurate estimation method for early acute appendicitis diagnosis using machine learning algorithms. Retrospective clinical records were analyzed with predictive data mining models. The predictive success of the models obtained by various machine learning algorithms was compared. A total of 595 clinical records were used in the study, including 348 males (58.49%) and 247 females (41.51%). It was found that the gradient boosted trees algorithm achieves the best success with an accurate prediction success of 95.31%. In this study, an estimation method based on machine learning was developed to identify individuals with acute appendicitis. It is thought that this method will benefit patients with signs of appendicitis, especially in emergency departments in hospitals.

Download Full-text