scholarly journals Data mining based investigation of the impact of imbalanced dataset over fractured zone detection

2021 ◽  
Vol 10 (2) ◽  
pp. 116
Author(s):  
Haleh Azizi ◽  
Hassan Reza

Several studies have been conducted in recent years to discriminate between fractured (FZs) and non-fractured zones (NFZs) in oil wells. These studies have applied data mining techniques to petrophysical logs (PLs) with generally valuable results; however, identifying fractured and non-fractured zones is difficult because imbalanced data is not treated as balanced data during analysis. We studied the importance of using balanced data to detect fractured zones using PLs. We used Random-Forest and Support Vector Machine classifiers on eight oil wells drilled into a fractured carbonite reservoir to study PLs with imbalanced and balanced datasets, then validated our results with image logs. A significant difference between accuracy and precision indicates imbalanced data with fractured zones categorized as the minor class. The results indicated that the accuracy of imbalanced and balanced datasets is similar, but precision is significantly improved by balancing, regardless of how low or high the calculated indices might be.  

Author(s):  
Ghulam Fatima ◽  
Sana Saeed

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.


2021 ◽  
Vol 1 (2) ◽  
pp. 19-24
Author(s):  
Halbast Rashid Ismael ◽  
Adnan Mohsin Abdulazeez ◽  
Dathar A. Hasan

The agriculture importance is not restricted to our daily life; it is also an effective field that enhances the economic growth in any country. Therefore, developing the quality of the crop yields using recent technologies is a crucial procedure to obtain competitive crops. Nowadays, data mining is an emerging research field in agriculture especially in the predicting and analysis of crop yield. This paper focuses on utilizing various data mining classification algorithms to predict the impact of various parameters such as area, season and production on the crop yield quality. The performance of the decision tree, naive Bayes, random forest, support vector machine and K-nearest neighbour is measured and compared to each other. The comparison involves measuring the error values and accuracy. The SVM algorithm achieved the highest accuracy value with 76.82%. while the lowest is achieved by the KNN algorithm with 35.76%. The highest error value was 111.8855 for KNN. Also, the prediction help farmer to increased and improved the income level.  


2011 ◽  
Vol 84-85 ◽  
pp. 405-409
Author(s):  
Wei He ◽  
Jie Xiong

Potential knowledge useful for traffic management optimization is hidden in a huge amount of data. Previous works use the prior data pattern labels to train the artificial neural network to attain the intelligent data mining models. The performance of the models suffers from the experts’ experience. To relieve the impact of the human factor, a new hybrid intelligent data mining model is proposed in this work based on self-organizing map (SOM) and support vector machine (SVM). The SOM was firstly used to capture the clustering information of the database through an unsupervised manner. Then the identified samples were treated as input to train the SVM. To optimize the SVM model, the particle swarm optimization (PSO) algorithm was employed to tune the SVM parameters and hence the satisfactory SVM data mining model was obtained. 2000 practical data sets from the Intelligent Transportation Systems (ITS) were applied to the validation of the proposed mining model. The analysis results show that the proposed method can extract the underlying rules of the testing data and can predict the future traffic state with the accuracy beyond 97%. Hence, the new SOM-PSO-SVM data mining model can provide practical application for the ITS.


2016 ◽  
Vol 26 (09n10) ◽  
pp. 1571-1580 ◽  
Author(s):  
Ming Cheng ◽  
Guoqing Wu ◽  
Hongyan Wan ◽  
Guoan You ◽  
Mengting Yuan ◽  
...  

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.


2021 ◽  
Author(s):  
Lei Feng ◽  
Xiangni Tian ◽  
Yousry A. El-Kassaby ◽  
Jian Qiu ◽  
Ze Feng ◽  
...  

Abstract Background: Melia azedarach L. is a globally distributed tree species of economic importance; however, it is unclear how the species distribution will respond to future climate changes.Methods: We aimed to select the most accurate one among seven data mining models to predict the species suitable contemporary and future habitats. These models include: maximum entropy (MaxEnt), support vector machine (SVM), generalized linear model (GLM), random forest (RF), naive bayesian model (NBM), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM). A total of 906 M. azedarach locations were identified, and sixteen climate predictors were used for model building. The models’ validity was assessed using three measures (Area Under the Curves (AUC), kappa, and accuracy). Results: We found that the RF provided the most outstanding performance in prediction power and generalization capacity. The top climate factors affecting the species distribution were mean coldest month temperature (MCMT), followed by the number of frost-free days (NFFD), degree-days above 18°C (DD>18), temperature difference between MWMT and MCMT, or continentality (TD), mean annual precipitation (MAP), and degree-days below 18°C (DD<18). We projected that future suitable habitat of this species would increase under both the RCP4.5 and RCP8.5 scenarios for the 2020s, 2050s, and 2080s.Conclusion: Our findings are expected to assist in better understanding the impact of climate change on the species and provide scientific basis for its planting and conservation.


2020 ◽  
Vol 12 (14) ◽  
pp. 2319 ◽  
Author(s):  
Joana Cardoso-Fernandes ◽  
Ana C. Teodoro ◽  
Alexandre Lima ◽  
Encarnación Roda-Robles

Machine learning (ML) algorithms have shown great performance in geological remote sensing applications. The study area of this work was the Fregeneda–Almendra region (Spain–Portugal) where the support vector machine (SVM) was employed. Lithium (Li)-pegmatite exploration using satellite data presents some challenges since pegmatites are, by nature, small, narrow bodies. Consequently, the following objectives were defined: (i) train several SVM’s on Sentinel-2 images with different parameters to find the optimal model; (ii) assess the impact of imbalanced data; (iii) develop a successful methodological approach to delineate target areas for Li-exploration. Parameter optimization and model evaluation was accomplished by a two-staged grid-search with cross-validation. Several new methodological advances were proposed, including a region of interest (ROI)-based splitting strategy to create the training and test subsets, a semi-automatization of the classification process, and the application of a more innovative and adequate metric score to choose the best model. The proposed methodology obtained good results, identifying known Li-pegmatite occurrences as well as other target areas for Li-exploration. Also, the results showed that the class imbalance had a negative impact on the SVM performance since known Li-pegmatite occurrences were not identified. The potentials and limitations of the methodology proposed are highlighted and its applicability to other case studies is discussed.


2020 ◽  
Author(s):  
Atbin Mahabbati ◽  
Jason Beringer ◽  
Matthias Leopold ◽  
Ian McHugh ◽  
James Cleverly ◽  
...  

Abstract. The errors and uncertainties associated with gap-filling algorithms of water, carbon and energy fluxes data, have always been one of the prominent challenges of the global network of microclimatological tower sites that use eddy covariance (EC) technique. To address this concern, and find more efficient gap-filling algorithms, we reviewed eight algorithms to estimate missing values of environmental drivers, and separately three major fluxes in EC time series. We then examined the performance of mentioned algorithms for different gap-filling scenarios utilising data from five OzFlux Network towers during 2013. The objectives of this research were (a) to evaluate the impact of training and testing window lengths on the performance of each algorithm; (b) to compare the performance of traditional and new gap-filling techniques for the EC data, for fluxes and their corresponding meteorological drivers. The performance of algorithms was evaluated by generating nine different training-testing window lengths, ranging from a day to 365 days. In each scenario, the gaps covered the data for the entirety of 2013 by consecutively repeating them, where, in each step, values were modelled by using earlier window data. After running each scenario, a variety of statistical metrics was used to evaluate the performance of the algorithms. The algorithms showed different levels of sensitivity to training-testing windows; The Prophet Forecast Model (FBP) revealed the most sensitivity, whilst the performance of artificial neural networks (ANNs), for instance, did not vary considerably by changing the window length. The performance of the algorithms generally decreased with increasing training-testing window length, yet the differences were not considerable for the windows smaller than 60 days. Gap-filling of the environmental drivers showed there was not a significant difference amongst the algorithms, the linear algorithms showed slight superiority over those of machine learning (ML), except the random forest algorithm estimating the ground heat flux (RMSEs of 30.17 and 34.93 for RF and CLR respectively). For the major fluxes, though, ML algorithms showed superiority (9 % less RMSE on average), except the Support Vector Regression (SVR), which provided significant bias in its estimations. Even though ANNs, random forest (RF) and extreme gradient boost (XGB) showed close performance in gap-filling of the major fluxes, RF provided more consistent results with less bias, relatively. The results indicated that there is no single algorithm which outperforms in all situations and therefore, but RF is a potential alternative for the ANNs as regards flux gap-filling.


2019 ◽  
Vol 2 (1) ◽  
pp. 60-79 ◽  
Author(s):  
Jessica Cook ◽  
Cuixian Chen ◽  
Angelia Griffin

In a society where first hand work experience is greatly valued many universities or institutions of higher education have designed their Quality enhancement plan (QEP) to address student applied learning. This paper is the results of a university’s QEP plan, called Experiencing Transformative Education Through Applied Learning or ETEAL.  This paper will highlight the research that was conducted using text mining and data mining techniques to analyze a dataset of 672 student evaluations collected from 40 different applied learning courses from fall 2013 to spring 2015, in order to evaluate the impact on instructional practice and student learning. Text mining techniques are applied through the NVivo text mining software to find the 100 most frequent terms to create a document-term matrix in Excel. Then, the document-term matrix is merged with the manual interpretation scores received to create the applied learning assessment data. Lastly, data mining techniques are applied to evaluate the performance, including Random Forest, K-nearest neighbors, Support Vector Machines (with linear and radial kernel), and 5-fold cross-validation. Our results show that the proposed text mining and data mining approach can provide prediction rates of around 67% to 85%, while the decision fusion approach can provide an improvement of 69% to 86%. Our study demonstrates that automatic quantitative analysis of student evaluations can be an effective approach to applied learning assessment.


Author(s):  
Acharoui Zakia ◽  
Ettaki Badia ◽  
Zerouaoui Jamal

People spend more time on social media either for personal or social interest which generates an expanding amount of Data. This paper is written for researchers seeking to have an overview of the different technical methods used for political purposes principally Data Mining and Social Network Analysis. Hence, the first part introduces the impact of Social Media on politics for different aims such as communicating with voters, promoting participation, and predicting election results, then the two main methods to achieve political purposes were presented. Data mining approaches is likely to be used on political context to classify citizen’s opinion or predicting results thus by using methods such as term occurrence, mentions, Support Vector Machine, Machine Learning, and Artificial Neural Networks. The Social Network Analysis approaches are used to retrieve data about influencers, their role during a period, and the nature of the information shared.


2021 ◽  
Vol 39 (11) ◽  
pp. 1331-1340
Author(s):  
Janaína Lopes Dias ◽  
Michele Kremer Sott ◽  
Caroline Cipolatto Ferrão ◽  
João Carlos Furtado ◽  
Jorge André Ribas Moraes

The processes related to solid waste management (SWM) are being revised as new technologies emerge and are applied in the area to achieve greater environmental, social and economic sustainability for society. To achieve our goal, two robust review protocols (Population, Intervention, Comparison, Outcome, and Context (PICOC) and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)) were used to systematically analyze 62 documents extracted from the Web of Science database to identify the main techniques and tools for Knowledge Discovery in Databases (KDD) and Data Mining (DM) as applied to SWM and explore the technological potential to optimize the stages of collecting and transporting waste. Moreover, it was possible to analyze the main challenges and opportunities of KDD and DM for SWM. The results show that the most used tools for SWM are MATLAB (29.7%) and GIS (13.5%), whereas the most used techniques are Artificial Neural Networks (35.8%), Linear Regression (16.0%) and Support Vector Machine (12.3%). In addition, 15.3% of the studies were conducted with data from China, 11.1% from India and 9.7% of the studies analyzed and compared data from several other countries. Furthermore, the research showed that the main challenges in the field of study are related to the collection and treatment of data, whereas the opportunities appear to be linked mainly to the impact on the pillars of sustainable development. Thus, this study portrays important issues associated with the use of KDD and DM for optimal SWM and has the potential to assist and direct researchers and field professionals in future studies.


Sign in / Sign up

Export Citation Format

Share Document