Price Prediction for Pre-Owned Cars Using Ensemble Machine Learning Techniques

The Pre-owned cars or so-called used cars have capacious markets across the globe. Before acquiring a used car, the buyer should be able to decide whether the price affixed for the car is genuine. Several facets including mileage, year, model, make, run and many more are needed to be considered before getting a hold of any pre-owned car. Both the seller and the buyer should have a fair deal. This paper presents a system that has been implemented to predict a fair price for any pre-owned car. The system works well to anticipate the price of used cars for the Mumbai region. Ensemble techniques in machine learning namely Random Forest Algorithm, eXtreme Gradient Boost are deployed to develop models that can predict an appropriate price for the used cars. The techniques are compared so as to determine an optimal one. Both the methods provided comparable performance wherein eXtreme Boost outperformed the random forest algorithm. Root Mean Squared Error of random forest recorded 3.44 whereas eXtreme Boost displayed 0.53.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

PCirc: random forest-based plant circRNA identification software

BMC Bioinformatics ◽

10.1186/s12859-020-03944-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shuwei Yin ◽

Xiao Tian ◽

Jingjing Zhang ◽

Peisen Sun ◽

Guanglin Li

Keyword(s):

Machine Learning ◽

Random Forest ◽

Circular Rna ◽

Learning Model ◽

Open Reading Frames ◽

Machine Learning Techniques ◽

Rna Seq ◽

Random Forest Algorithm ◽

Machine Learning Model ◽

Identification Software

Abstract Background Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. Results In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc (https://github.com/Lilab-SNNU/Pcirc). Conclusion Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use.

Download Full-text

Improving Heart Disease Prediction Using Random Forest and AdaBoost Algorithms

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v17i11.24781 ◽

2021 ◽

Vol 17 (11) ◽

pp. 60

Author(s):

Halima EL Hamdaoui ◽

Said Boujraf ◽

Nour El Houda Chaoui ◽

Badr Alami ◽

Mustapha Maaroufi

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Decision Support ◽

Random Forest ◽

Clinical Decision Support ◽

Clinical Decision ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Random Forest Algorithm ◽

Adaboost Algorithm

heart disease is a major cause of death worldwide. Thus, diagnosis and prediction of heart disease remain mandatory. Clinical decision support systems based on machine learning techniques have become the primary tool to assist clinicians and contribute to automated diagnosis. This paper aims to predict heart disease using Random Forest algorithm enhanced with the boosting algorithm Adaboost. The model is trained and tested on University of California Irvine (UCI) Cleveland and Statlog heart disease datasets using the most relevant features 14 attributes. The result shows that Random Forest algorithm combined with AdaBoost algorithm achieved higher accuracy than applying only Radom Forest algorithm, 96.16%, 95.98%, respectively. We compare our suggested model to report machine learning classifiers. Indeed, the obtained result is supporting the efficiency and validity of our model. Besides, the proposed model achieved high accuracy compared to existing studies in the literature that confirmed that a clinical decision support system could be used to predict heart disease based on machine learning algorithms.

Download Full-text

ANALYSIS OF SINGLE AND ENSEMBLE MACHINE LEARNING CLASSIFIERS FOR PHISHING ATTACKS DETECTION

International Journal of Computer Systems & Software Engineering ◽

10.15282/ijsecs.7.2.2021.5.0088 ◽

2021 ◽

Vol 7 (2) ◽

pp. 44-49

Author(s):

Oyelakin A. M ◽

Alimi O. M ◽

Mustapha I. O ◽

Ajiboye I. K

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Trees ◽

Random Forest Algorithm ◽

Ensemble Techniques ◽

Learning Classifiers ◽

Phishing Attacks ◽

Ensemble Machine Learning

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.

Download Full-text

Comparations of Supervised Machine Learning Techniques in Predicting the Classification of the Household’s Welfare Status

Journal Pekommas ◽

10.30818/jpkm.2019.2040105 ◽

2019 ◽

Vol 4 (1) ◽

pp. 43

Author(s):

Nfn Nofriani

Keyword(s):

Machine Learning ◽

Random Forest ◽

Social Assistance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Random Forest Algorithm ◽

K Nearest Neighbor ◽

Learning Techniques

Poverty has been a major problem for most countries around the world, including Indonesia. One approach to eradicate poverty is through equitable distribution of social assistance for target households based on Integrated Database of social assistance. This study has compared several well-known supervised machine learning techniques, namely: Naïve Bayes Classifier, Support Vector Machines, K-Nearest Neighbor Classification, C4.5 Algorithm, and Random Forest Algorithm to predict household welfare status classification by using an Integrated Database as a study case. The main objective of this study was to choose the best-supervised machine learning approach in predicting the classification of household’s welfare status based on attributes in the Integrated Database. The results showed that the Random Forest Algorithm was the best.

Download Full-text

Ensemble Machine Learning Assisted Reservoir Characterization Using Field Production Data–An Offshore Field Case Study

Energies ◽

10.3390/en14041052 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1052

Author(s):

Baozhong Wang ◽

Jyotsna Sharma ◽

Jianhua Chen ◽

Patricia Persaud

Keyword(s):

Machine Learning ◽

Random Forest ◽

Reservoir Characterization ◽

Time Lapse ◽

Production Data ◽

Oil Saturation ◽

Ensemble Machine Learning ◽

Input Parameters ◽

Saturation Profiles ◽

Field Production

Estimation of fluid saturation is an important step in dynamic reservoir characterization. Machine learning techniques have been increasingly used in recent years for reservoir saturation prediction workflows. However, most of these studies require input parameters derived from cores, petrophysical logs, or seismic data, which may not always be readily available. Additionally, very few studies incorporate the production data, which is an important reflection of the dynamic reservoir properties and also typically the most frequently and reliably measured quantity throughout the life of a field. In this research, the random forest ensemble machine learning algorithm is implemented that uses the field-wide production and injection data (both measured at the surface) as the only input parameters to predict the time-lapse oil saturation profiles at well locations. The algorithm is optimized using feature selection based on feature importance score and Pearson correlation coefficient, in combination with geophysical domain-knowledge. The workflow is demonstrated using the actual field data from a structurally complex, heterogeneous, and heavily faulted offshore reservoir. The random forest model captures the trends from three and a half years of historical field production, injection, and simulated saturation data to predict future time-lapse oil saturation profiles at four deviated well locations with over 90% R-square, less than 6% Root Mean Square Error, and less than 7% Mean Absolute Percentage Error, in each case.

Download Full-text

Classification and photometric redshift estimation of quasars in photometric surveys

Proceedings of the International Astronomical Union ◽

10.1017/s1743921320001829 ◽

2020 ◽

Vol 15 (S359) ◽

pp. 40-41

Author(s):

L. M. Izuti Nakazono ◽

C. Mendes de Oliveira ◽

N. S. T. Hirata ◽

S. Jeram ◽

A. Gonzalez ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Nearest Neighbour ◽

Random Forest Algorithm ◽

Photometric Redshift ◽

Using Data

AbstractWe present a machine learning methodology to separate quasars from galaxies and stars using data from S-PLUS in the Stripe-82 region. In terms of quasar classification, we achieved 95.49% for precision and 95.26% for recall using a Random Forest algorithm. For photometric redshift estimation, we obtained a precision of 6% using k-Nearest Neighbour.

Download Full-text

Cocrystal Prediction Using Machine Learning Models and Descriptors

Applied Sciences ◽

10.3390/app11031323 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1323

Author(s):

Medard Edmund Mswahili ◽

Min-Jeong Lee ◽

Gati Lother Martin ◽

Junghyun Kim ◽

Paul Kim ◽

...

Keyword(s):

Machine Learning ◽

Academic Research ◽

Pharmaceutical Research ◽

Machine Learning Techniques ◽

Learning Models ◽

Pharmaceutical Ingredients ◽

Learning Techniques ◽

Comparable Performance ◽

Selection Algorithms ◽

Machine Learning Models

Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challenging step in cocrystal development. Recently, machine learning techniques are attracting researchers in many fields including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simplified molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artificial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.

Download Full-text

Prediction of Short-Distance Aerial Movement of Phakopsora pachyrhizi Urediniospores Using Machine Learning

Phytopathology ◽

10.1094/phyto-04-17-0138-fi ◽

2017 ◽

Vol 107 (10) ◽

pp. 1187-1198 ◽

Cited By ~ 7

Author(s):

L. Wen ◽

C. R. Bowen ◽

G. L. Hartman

Keyword(s):

Machine Learning ◽

Random Forest ◽

Short Distance ◽

Soybean Rust ◽

Machine Learning Techniques ◽

Phakopsora Pachyrhizi ◽

Primary Means ◽

Soybean Plants ◽

Selection Operator ◽

Active Trap

Dispersal of urediniospores by wind is the primary means of spread for Phakopsora pachyrhizi, the cause of soybean rust. Our research focused on the short-distance movement of urediniospores from within the soybean canopy and up to 61 m from field-grown rust-infected soybean plants. Environmental variables were used to develop and compare models including the least absolute shrinkage and selection operator regression, zero-inflated Poisson/regular Poisson regression, random forest, and neural network to describe deposition of urediniospores collected in passive and active traps. All four models identified distance of trap from source, humidity, temperature, wind direction, and wind speed as the five most important variables influencing short-distance movement of urediniospores. The random forest model provided the best predictions, explaining 76.1 and 86.8% of the total variation in the passive- and active-trap datasets, respectively. The prediction accuracy based on the correlation coefficient (r) between predicted values and the true values were 0.83 (P < 0.0001) and 0.94 (P < 0.0001) for the passive and active trap datasets, respectively. Overall, multiple machine learning techniques identified the most important variables to make the most accurate predictions of movement of P. pachyrhizi urediniospores short-distance.

Download Full-text

Effects of air quality on the health of Mediterranean forests

10.5194/egusphere-egu21-16171 ◽

2021 ◽

Author(s):

Adrián García Bruzón ◽

Patricia Arrogante Funes ◽

Laura Muñoz Moral

Keyword(s):

Climate Change ◽

Machine Learning ◽

Random Forest ◽

Aridity Index ◽

Plant Health ◽

Mediterranean Forests ◽

Random Forest Algorithm ◽

The Mediterranean ◽

Heterogeneous Variables ◽

Peninsular Spain

The climate change has turned out to be a determining factor in the development of forest in Spain. Production systems have emitted polluting gases and other particles into the atmosphere, for which some plants have not yet developed adaptation systems. Among the most harmful pollutants for the environment are gases such as nitrous oxides, ozone, particulate matter.However, this condition is not the same in Peninsular Spain, and the Balearic Islands since the plant compositions differ in the territory and the bioclimatic, topographic, and anthropic characteristics. Monitoring the vegetation with sufficient spatial and temporal resolution, studying variables conditioning plant health is a challenge from the nature of the variables and the amount of data to be handled.&#160;The Mediterranean forest is one of the most ecosystem affected by climate change because of usually experimented long periods of drought that, in combination with increased temperatures, can drastically reduce the photosynthetic activity of trees and therefore the biomass of forests.That is why the application of environmental technologies based on Remote Sensing (which provide plant health indices from passive sensors on satellite platforms and other variables of interest), Geographic Information Systems (to integrate, process, analyze spatial and temporal data) and machine learning models (which facilitate the extraction of relationships between variables, conditioning factors and predict patterns).&#160;In this regard, this work's objective is to evaluate the possible effect that different pollutants have on the health of the vegetation, measured from the annual values of the Normalized Difference Vegetation Index (NDVI), in the Mediterranean forests of Peninsular Spain. To achieve this, we are used machine learning techniques using the Random Forest algorithm. The study has also been done with various climatic, topographic, and anthropic variables that characterize the forest to carry it out.&#160;The results showed that certain variables such as the aridity index had generated the NDVI values and therefore plant development, while others are limiting factors such as the concentration of certain pollutants and the direct relationship between them particulates and NOx. This study can verify how the Random Forest algorithm offers reliable results, even when working with heterogeneous variables.&#160;

Download Full-text