A gradient boosting decision tree algorithm combining synthetic minority oversampling technique for lithology identification

Geophysics ◽  
2020 ◽  
Vol 85 (4) ◽  
pp. WA147-WA158
Author(s):  
Kaibo Zhou ◽  
Jianyu Zhang ◽  
Yusong Ren ◽  
Zhen Huang ◽  
Luanxiao Zhao

Lithology identification based on conventional well-logging data is of great importance for geologic features characterization and reservoir quality evaluation in the exploration and production development of petroleum reservoirs. However, there are some limitations in the traditional lithology identification process: (1) It is very time consuming to build a model so that it cannot realize real-time lithology identification during well drilling, (2) it must be modeled by experienced geologists, which consumes a lot of manpower and material resources, and (3) the imbalance of labeled data in well-log data may reduce the classification performance of the model. We have developed a gradient boosting decision tree (GBDT) algorithm combining synthetic minority oversampling technique (SMOTE) to realize fast and automatic lithology identification. First, the raw well-log data are normalized by maximum and minimum normalization algorithm. Then, SMOTE is adopted to balance the number of samples in each class in training process. Next, a lithology identification model is built by GBDT to fit the preprocessed training data set. Finally, the built model is verified with the testing data set. The experimental results indicate that the proposed approach improves the lithology identification performance compared with other machine-learning approaches.

2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Abbas Akkasi ◽  
Ekrem Varoğlu ◽  
Nazife Dimililer

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3092 ◽  
Author(s):  
Shih-Hsiung Liang ◽  
Bruno Andreas Walther ◽  
Bao-Sen Shieh

Background Biological invasions have become a major threat to biodiversity, and identifying determinants underlying success at different stages of the invasion process is essential for both prevention management and testing ecological theories. To investigate variables associated with different stages of the invasion process in a local region such as Taiwan, potential problems using traditional parametric analyses include too many variables of different data types (nominal, ordinal, and interval) and a relatively small data set with too many missing values. Methods We therefore used five decision tree models instead and compared their performance. Our dataset contains 283 exotic bird species which were transported to Taiwan; of these 283 species, 95 species escaped to the field successfully (introduction success); of these 95 introduced species, 36 species reproduced in the field of Taiwan successfully (establishment success). For each species, we collected 22 variables associated with human selectivity and species traits which may determine success during the introduction stage and establishment stage. For each decision tree model, we performed three variable treatments: (I) including all 22 variables, (II) excluding nominal variables, and (III) excluding nominal variables and replacing ordinal values with binary ones. Five performance measures were used to compare models, namely, area under the receiver operating characteristic curve (AUROC), specificity, precision, recall, and accuracy. Results The gradient boosting models performed best overall among the five decision tree models for both introduction and establishment success and across variable treatments. The most important variables for predicting introduction success were the bird family, the number of invaded countries, and variables associated with environmental adaptation, whereas the most important variables for predicting establishment success were the number of invaded countries and variables associated with reproduction. Discussion Our final optimal models achieved relatively high performance values, and we discuss differences in performance with regard to sample size and variable treatments. Our results showed that, for both the establishment model and introduction model, the number of invaded countries was the most important or second most important determinant, respectively. Therefore, we suggest that future success for introduction and establishment of exotic birds may be gauged by simply looking at previous success in invading other countries. Finally, we found that species traits related to reproduction were more important in establishment models than in introduction models; importantly, these determinants were not averaged but either minimum or maximum values of species traits. Therefore, we suggest that in addition to averaged values, reproductive potential represented by minimum and maximum values of species traits should be considered in invasion studies.


Author(s):  
S. Prasanthi ◽  
S.Durga Bhavani ◽  
T. Sobha Rani ◽  
Raju S. Bapi

Vast majority of successful drugs or inhibitors achieve their activity by binding to, and modifying the activity of a protein leading to the concept of druggability. A target protein is druggable if it has the potential to bind the drug-like molecules. Hence kinase inhibitors need to be studied to understand the specificity of a kinase inhibitor in choosing a particular kinase target. In this paper we focus on human kinase drug target sequences since kinases are known to be potential drug targets. Also we do a preliminary analysis of kinase inhibitors in order to study the problem in the protein-ligand space in future. The identification of druggable kinases is treated as a classification problem in which druggable kinases are taken as positive data set and non-druggable kinases are chosen as negative data set. The classification problem is addressed using machine learning techniques like support vector machine (SVM) and decision tree (DT) and using sequence-specific features. One of the challenges of this classification problem is due to the unbalanced data with only 48 druggable kinases available against 509 non-drugggable kinases present at Uniprot. The accuracy of the decision tree classifier obtained is 57.65 which is not satisfactory. A two-tier architecture of decision trees is carefully designed such that recognition on the non-druggable dataset also gets improved. Thus the overall model is shown to achieve a final performance accuracy of 88.37. To the best of our knowledge, kinase druggability prediction using machine learning approaches has not been reported in literature.


2020 ◽  
pp. 865-874
Author(s):  
Enrico Santus ◽  
Tal Schuster ◽  
Amir M. Tahmasebi ◽  
Clara Li ◽  
Adam Yala ◽  
...  

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.


2021 ◽  
Author(s):  
Antoine Bouziat ◽  
Sylvain Desroziers ◽  
Abdoulaye Koroko ◽  
Antoine Lechevallier ◽  
Mathieu Feraille ◽  
...  

<p>Automation and robotics raise growing interests in the mining industry. If not already a reality, it is no more science fiction to imagine autonomous robots routinely participating in the exploration and extraction of mineral raw materials in the near future. Among the various scientific and technical issues to be addressed towards this objective, this study focuses on the automation of real-time characterisation of rock images captured on the field, either to discriminate rock types and mineral species or to detect small elements such as mineral grains or metallic nuggets. To do so, we investigate the potential of methods from the Computer Vision community, a subfield of Artificial Intelligence dedicated to image processing. In particular, we aim at assessing the potential of Deep Learning approaches and convolutional neuronal networks (CNN) for the analysis of field samples pictures, highlighting key challenges before an industrial use in operational contexts.</p><p>In a first initiative, we appraise Deep Learning methods to classify photographs of macroscopic rock samples between 12 lithological families. Using the architecture of reference CNN and a collection of 2700 images, we achieve a prediction accuracy above 90% for new pictures of good photographic quality. Nonetheless we then seek to improve the robustness of the method for on-the-fly field photographs. To do so, we train an additional CNN to automatically separate the rock sample from the background, with a detection algorithm. We also introduce a more sophisticated classification method combining a set of several CNN with a decision tree. The CNN are specifically trained to recognise petrological features such as textures, structures or mineral species, while the decision tree mimics the naturalist methodology for lithological identification.</p><p>In a second initiative, we evaluate Deep Learning techniques to spot and delimitate specific elements in finer-scale images. We use a data set of carbonate thin sections with various species of microfossils. The data comes from a sedimentology study but analogies can be drawn with igneous geology use cases. We train four state-of-the-art Deep Learning methods for object detection with a limited data set of 15 annotated images. The results on 130 other thin section images are then qualitatively assessed by expert geologists, and precisions and inference times quantitatively measured. The four models show good capabilities in detecting and categorising the microfossils. However differences in accuracy and performance are underlined, leading to recommendations for comparable projects in a mining context.</p><p>Altogether, this study illustrates the power of Computer Vision and Deep Learning approaches to automate rock image analysis. However, to make the most of these technologies in mining activities, stimulating research opportunities lies in adapting the algorithms to the geological use cases, embedding as much geological knowledge as possible in the statistical models, and mitigating the number of training data to be manually interpreted beforehand.   </p>


2021 ◽  
Author(s):  
Changming Zhao ◽  
Dongrui Wu ◽  
Jian Huang ◽  
Ye Yuan ◽  
Hai-Tao Zhang ◽  
...  

Abstract Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This article proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It achieves high randomness (diversity) by sampling its parameters randomly from a parameter pool, and selecting a subset of features randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 34 classification and regression datasets. Remarkably, BoostForest has only one hyper-parameter (the number of BoostTrees), which can be easily specified. Our code is publicly available, and the proposed ensemble learning framework can also be used to combine many other base learners.


Author(s):  
C. Zhang ◽  
X. Pan ◽  
S. Q. Zhang ◽  
H. P. Li ◽  
P. M. Atkinson

Recent advances in remote sensing have witnessed a great amount of very high resolution (VHR) images acquired at sub-metre spatial resolution. These VHR remotely sensed data has post enormous challenges in processing, analysing and classifying them effectively due to the high spatial complexity and heterogeneity. Although many computer-aid classification methods that based on machine learning approaches have been developed over the past decades, most of them are developed toward pixel level spectral differentiation, e.g. Multi-Layer Perceptron (MLP), which are unable to exploit abundant spatial details within VHR images. <br><br> This paper introduced a rough set model as a general framework to objectively characterize the uncertainty in CNN classification results, and further partition them into correctness and incorrectness on the map. The correct classification regions of CNN were trusted and maintained, whereas the misclassification areas were reclassified using a decision tree with both CNN and MLP. The effectiveness of the proposed rough set decision tree based MLP-CNN was tested using an urban area at Bournemouth, United Kingdom. The MLP-CNN, well capturing the complementarity between CNN and MLP through the rough set based decision tree, achieved the best classification performance both visually and numerically. Therefore, this research paves the way to achieve fully automatic and effective VHR image classification.


Author(s):  
Gebreab K. Zewdie ◽  
David J. Lary ◽  
Estelle Levetin ◽  
Gemechu F. Garuma

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.


2020 ◽  
Vol 8 (4) ◽  
pp. T1057-T1069
Author(s):  
Ritesh Kumar Sharma ◽  
Satinder Chopra ◽  
Larry Lines

The discrimination of fluid content and lithology in a reservoir is important because it has a bearing on reservoir development and its management. Among other things, rock-physics analysis is usually carried out to distinguish between the lithology and fluid components of a reservoir by way of estimating the volume of clay, water saturation, and porosity using seismic data. Although these rock-physics parameters are easy to compute for conventional plays, there are many uncertainties in their estimation for unconventional plays, especially where multiple zones need to be characterized simultaneously. We have evaluated such uncertainties with reference to a data set from the Delaware Basin where the Bone Spring, Wolfcamp, Barnett, and Mississippian Formations are the prospective zones. Attempts at seismic reservoir characterization of these formations have been developed in Part 1 of this paper, where the geologic background of the area of study, the preconditioning of prestack seismic data, well-log correlation, accounting for the temporal and lateral variation in the seismic wavelets, and building of robust low-frequency model for prestack simultaneous impedance inversion were determined. We determine the challenges and the uncertainty in the characterization of the Bone Spring, Wolfcamp, Barnett, and Mississippian sections and explain how we overcame those. In the light of these uncertainties, we decide that any deterministic approach for characterization of the target formations of interest may not be appropriate and we build a case for adopting a robust statistical approach. Making use of neutron porosity and density porosity well-log data in the formations of interest, we determine how the type of shale, volume of shale, effective porosity, and lithoclassification can be carried out. Using the available log data, multimineral analysis was also carried out using a nonlinear optimization approach, which lent support to our facies classification. We then extend this exercise to derived seismic attributes for determination of the lithofacies volumes and their probabilities, together with their correlations with the facies information derived from mud log data.


Sign in / Sign up

Export Citation Format

Share Document