Data Mining Classification Algorithms for Analyzing Soil Data

Rapid changes are occurring in our global ecosystem, and stresses on human well-being, such as climate regulation and food production, are increasing, soil is a critical component of agriculture. The project aims to use Data Mining (DM) classification techniques to predict soil data. Analysis DM classification strategies such as k-Nearest-Neighbors (k-NN), Random-Forest (RF), Decision-Tree (DT) and Naïve-Bayes (NB) are used to predict soil type. These classifier algorithms are used to extract information from soil data. The main purpose of using these classifiers is to find the optimal machine learning classifier in the soil classification. in this paper we are applying some algorithms of DM and machine learning on the data set that we collected by using Weka program, then we compare the experimental result with other papers that worked like our work. According to the experimental results, the highest accuracy is k-NN has of 84 % when compared to the NB (69.23%), DT and RF (53.85 %). As a result, it outperforms the other classifiers. The findings imply that k-NN could be useful for accurate soil type classification in the agricultural domain.

Download Full-text

Early prediction of chronic disease using an efficient machine learning algorithm through adaptive probabilistic divergence based feature selection approach

International Journal of Pervasive Computing and Communications ◽

10.1108/ijpcc-04-2020-0018 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 2

Author(s):

Sandeepkumar Hegde ◽

Monica R. Mundada

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Chronic Disease ◽

Learning Algorithm ◽

Predictive Ability ◽

Experimental Result ◽

Data Set ◽

Content Type ◽

Learning Classifier ◽

Feature Selection Approach

Purpose According to the World Health Organization, by 2025, the contribution of chronic disease is expected to rise by 73% compared to all deaths and it is considered as global burden of disease with a rate of 60%. These diseases persist for a longer duration of time, which are almost incurable and can only be controlled. Cardiovascular disease, chronic kidney disease (CKD) and diabetes mellitus are considered as three major chronic diseases that will increase the risk among the adults, as they get older. CKD is considered a major disease among all these chronic diseases, which will increase the risk among the adults as they get older. Overall 10% of the population of the world is affected by CKD and it is likely to double in the year 2030. The paper aims to propose novel feature selection approach in combination with the machine-learning algorithm which can early predict the chronic disease with utmost accuracy. Hence, a novel feature selection adaptive probabilistic divergence-based feature selection (APDFS) algorithm is proposed in combination with the hyper-parameterized logistic regression model (HLRM) for the early prediction of chronic disease. Design/methodology/approach A novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals in India. The HLRM is used as a machine-learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results compared to the existing work in most of the cases. Findings The performance of the proposed framework is validated by using the metric such as recall, precision, F1 measure and ROC. The predictive performance of the proposed framework is analyzed by passing the data set belongs to various chronic disease such as CKD, diabetes and heart disease. The diagnostic ability of the proposed approach is demonstrated by comparing its result with existing algorithms. The experimental figures illustrated that the proposed framework performed exceptionally well in prior prediction of CKD disease with an accuracy of 91.6. Originality/value The capability of the machine learning algorithms depends on feature selection (FS) algorithms in identifying the relevant traits from the data set, which impact the predictive result. It is considered as a process of choosing the relevant features from the data set by removing redundant and irrelevant features. Although there are many approaches that have been already proposed toward this objective, they are computationally complex because of the strategy of following a one-step scheme in selecting the features. In this paper, a novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The proposed algorithm handles the process of feature selection in two separate indices. Hence, the computational complexity of the algorithm is reduced to O(nk+1). The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals of karkala taluk ,India. The HLRM is used as a machine learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results are compared to the existing work in most of the cases.

Download Full-text

A Comparative Analysis and Predicting for Breast Cancer Detection Based on Data Mining Models

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v8i430209 ◽

2021 ◽

pp. 45-59

Author(s):

Shler Farhad Khorshid ◽

Adnan Mohsin Abdulazeez ◽

Amira Bibo Sallow

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Data Mining ◽

Nearest Neighbors ◽

Performance Comparison ◽

Machine Learning Algorithms ◽

Support Vector ◽

K Nearest Neighbors ◽

Data Set ◽

Wide Range

Breast cancer is one of the most common diseases among women, accounting for many deaths each year. Even though cancer can be treated and cured in its early stages, many patients are diagnosed at a late stage. Data mining is the method of finding or extracting information from massive databases or datasets, and it is a field of computer science with a lot of potentials. It covers a wide range of areas, one of which is classification. Classification may also be accomplished using a variety of methods or algorithms. With the aid of MATLAB, five classification algorithms were compared. This paper presents a performance comparison among the classifiers: Support Vector Machine (SVM), Logistics Regression (LR), K-Nearest Neighbors (K-NN), Weighted K-Nearest Neighbors (Weighted K-NN), and Gaussian Naïve Bayes (Gaussian NB). The data set was taken from UCI Machine learning Repository. The main objective of this study is to classify breast cancer women using the application of machine learning algorithms based on their accuracy. The results have revealed that Weighted K-NN (96.7%) has the highest accuracy among all the classifiers.

Download Full-text

A Program for Automated Data Mining of PubChem to Screen a Billion Compounds and Generate by Machine Learning Based AutoQSAR Algorithm Anti-Corona Viral Drug Leads (Replicase Polyprotein 1ab Inhibitors) and in Silico Study of the Top Drug Lead Compounds

10.26434/chemrxiv.12423638.v1 ◽

2020 ◽

Author(s):

Ben Geoffrey A S ◽

Akhil Sanker ◽

Host Antony Davidd ◽

Judith Gracia

Keyword(s):

Machine Learning ◽

Data Mining ◽

In Silico ◽

Drug Target ◽

Data Set ◽

Lead Compounds ◽

Pubchem Database ◽

Drug Lead ◽

Lead Generation ◽

Drug Leads

Our work is composed of a python program for automatic data mining of PubChem database to collect data associated with the corona virus drug target replicase polyprotein 1ab (UniProt identifier : POC6X7 ) of data set involving active compounds, their activity value (IC50) and their chemical/molecular descriptors to run a machine learning based AutoQSAR algorithm on the data set to generate anti-corona viral drug leads. The machine learning based AutoQSAR algorithm involves feature selection, QSAR modelling, validation and prediction. The drug leads generated each time the program is run is reflective of the constantly growing PubChem database is an important dynamic feature of the program which facilitates fast and dynamic anti-corona viral drug lead generation reflective of the constantly growing PubChem database. The program prints out the top anti-corona viral drug leads after screening PubChem library which is over a billion compounds. The interaction of top drug lead compounds generated by the program and two corona viral drug target proteins, 3-Cystiene like Protease (3CLPro) and Papain like protease (PLpro) was studied and analysed using molecular docking tools. The compounds generated as drug leads by the program showed favourable interaction with the drug target proteins and thus we recommend the program for use in anti-corona viral compound drug lead generation as it helps reduce the complexity of virtual screening and ushers in an age of automatic ease in drug lead generation. The leads generated by the program can further be tested for drug potential through further In Silico, In Vitro and In Vivo testing <div><br></div><div><div>The program is hosted, maintained and supported at the GitHub repository link given below</div><div><br></div><div>https://github.com/bengeof/Drug-Discovery-P0C6X7</div></div><div><br></div>

Download Full-text

Learning Classifier Systems: The Rise of Genetics-Based Machine Learning in Biomedical Data Mining

Methods in Biomedical Informatics ◽

10.1016/b978-0-12-401678-1.00009-9 ◽

2014 ◽

pp. 265-311 ◽

Cited By ~ 1

Author(s):

Ryan J. Urbanowicz ◽

Jason H. Moore

Keyword(s):

Machine Learning ◽

Data Mining ◽

Learning Classifier Systems ◽

Biomedical Data ◽

Classifier Systems ◽

Learning Classifier

Download Full-text

Learning classifier systems for data mining: a comparison of XCS with other classifiers for the Forest Cover data set

Proceedings of the International Joint Conference on Neural Networks, 2003. ◽

10.1109/ijcnn.2003.1223681 ◽

2004 ◽

Cited By ~ 6

Author(s):

A.J. Bagnall ◽

G.C. Cawley

Keyword(s):

Data Mining ◽

Forest Cover ◽

Learning Classifier Systems ◽

Classifier Systems ◽

Data Set ◽

Learning Classifier

Download Full-text

An Integration of Cardiovascular Event Data and Machine Learning Models for Cardiac Arrest Predictions

International Journal of Health Sciences and Pharmacy ◽

10.47992/ijhsp.2581.6411.0061 ◽

2021 ◽

pp. 55-71

Author(s):

Krishna Prasad K ◽

Aithal P. S. ◽

Navin N. Bappalige ◽

Soumya S

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Area Under The Curve ◽

Computer Applications ◽

Data Sets ◽

Cardiovascular Risks ◽

Data Set ◽

Average Area ◽

Learning Classifier ◽

Tree Classifier

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.

Download Full-text

Soil Classification and Harvest Proposal Implemented using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.h6110.1091220 ◽

2020 ◽

Vol 9 (12) ◽

pp. 19-22

Keyword(s):

Machine Learning ◽

Soil Type ◽

Soil Classification ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Nearest Neighbour ◽

The People ◽

Learning Techniques ◽

Crop Type

The major source of living for the people of India is agriculture. It is considered as important economy for the country. India is one of the country that suffer from natural calamities like drought and flood that may destroy the crops which may lead to heavy loss for the people doing agriculture. Predicting the crop type can help them to cultivate the suitable crop that can be cultivated in that particular soil type. Soil is one major factor or agriculture. There are several types of soil available in our county. In order to classify the soil type we need to understand the characteristics of the soil. Data mining and machine learning is one of the emerging technology in the field of agriculture and horticulture. In order to classify the soil type and Provide suggestion of fertilizers that can improve the growth of the crop cultivated in that particular soil type plays major role in agriculture. For that here exploring Several machine learning algorithms such as Support vector machine(SVM),k-Nearest Neighbour(k-NN) and logistic regression are used to classify the soil type.

Download Full-text

Data mining methods of healthy indoor climate coefficients for comfortable well-being

Ochrona Srodowiska i Zasobów Naturalnych ◽

10.2478/oszn-2018-0013 ◽

2018 ◽

Vol 29 (3) ◽

pp. 7-12

Author(s):

Grit Behrens ◽

Klaus Schlender ◽

Florian Fehring

Keyword(s):

Machine Learning ◽

Data Mining ◽

Indoor Air ◽

Well Being ◽

Machine Learning Algorithms ◽

Indoor Climate ◽

Measurement And Analysis ◽

Scientific Project ◽

Mining Methods ◽

Analysis System

Abstract This article provides information about a currently developed measurement and analysis system ‘Smart Monitoring’, which is used on scientific project in terms of healthy indoor air coefficients, as well as the processing of the collected data for machine learning algorithms. The target is to reduce CO2 emissions caused by wrong ventilation habits in building sector after renovation process in older buildings.

Download Full-text

Symptoms Based Multiple Disease Prediction Model using Machine Learning Approach

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i9364.0710921 ◽

2021 ◽

Vol 10 (9) ◽

pp. 67-72

Author(s):

Talasila Bhanuteja ◽

◽

Kilaru Venkata Narendra Kumar ◽

Kolli Sai Poornachand ◽

Chennupati Ashish ◽

...

Keyword(s):

Machine Learning ◽

Well Being ◽

Local Area ◽

Learning Approach ◽

Disease Prediction ◽

Decision Tree Classifier ◽

Data Set ◽

Tree Classifier ◽

Machine Learning Approach ◽

Disease Forecast

The turn of events and misuse of a few noticeable Data mining strategies in various genuine application regions (for example Trade, Medical management and Natural science) has induced the usage of such methods in Machine Learning (ML) constrains, to distinct helpful snippets of information of the predefined information in medical services networks, biomedical fields and so forth The exact examination of clinical data set advantages in early illness expectation, patient consideration and local area administrations. The methodology of Machine Learning (ML) has been effectively utilized in grouped technologies including Disease forecast. The objective of generating classifier framework utilizing Machine Learning (ML) models is to massively assist with addressing the well-being related issues by helping the doctors to foresee and analyze illnesses at a beginning phase. Sample information of 4920 patient’s records determined to have 41 illnesses was chosen for examination. A reliant variable was made out of 41 sicknesses. 95 of 132 autonomous variables (symptoms) firmly identified with infections were chosen and advanced. This examination work completed shows the illness expectation framework created utilizing Machine learning calculations like Random Forest, Decision Tree Classifier and LightGBM. The paper confers the relative investigation of the consequences of the above-mentioned algorithms are utilized efficiently.

Download Full-text

Understanding the expression of grievances in the Arabic Twitter-sphere using machine learning

Journal of Criminological Research Policy and Practice ◽

10.1108/jcrpp-02-2019-0009 ◽

2019 ◽

Vol 5 (2) ◽

pp. 108-119

Author(s):

Yeslam Al-Saggaf ◽

Amanda Davies

Keyword(s):

Machine Learning ◽

Data Mining ◽

Social Network ◽

Network Analysis ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Data Set ◽

Content Type ◽

The Social ◽

Twitter Users

Purpose The purpose of this paper is to discuss the design, application and findings of a case study in which the application of a machine learning algorithm is utilised to identify the grievances in Twitter in an Arabian context. Design/methodology/approach To understand the characteristics of the Twitter users who expressed the identified grievances, data mining techniques and social network analysis were utilised. The study extracted a total of 23,363 tweets and these were stored as a data set. The machine learning algorithm applied to this data set was followed by utilising a data mining process to explore the characteristics of the Twitter feed users. The network of the users was mapped and the individual level of interactivity and network density were calculated. Findings The machine learning algorithm revealed 12 themes all of which were underpinned by the coalition of Arab countries blockade of Qatar. The data mining analysis revealed that the tweets could be clustered in three clusters, the main cluster included users with a large number of followers and friends but who did not mention other users in their tweets. The social network analysis revealed that whilst a large proportion of users engaged in direct messages with others, the network ties between them were not registered as strong. Practical implications Borum (2011) notes that invoking grievances is the first step in the radicalisation process. It is hoped that by understanding these grievances, the study will shed light on what radical groups could invoke to win the sympathy of aggrieved people. Originality/value In combination, the machine learning algorithm offered insights into the grievances expressed within the tweets in an Arabian context. The data mining and the social network analyses revealed the characteristics of the Twitter users highlighting identifying and managing early intervention of radicalisation.

Download Full-text