Improving Logging Prediction on Imbalanced Datasets

Sangeeta Lal; Neetu Sardana; Ashish Sureka

doi:10.4018/ijossp.2016040103

Improving Logging Prediction on Imbalanced Datasets

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch039 ◽

2020 ◽

pp. 740-772

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

TOPICAL ISSUES OF APPLICATION OF MACHINE LEARNING METHODS IN ECONOMY

Инновационные аспекты развития науки и техники. Сборник статей VIII Международной научно-практической конференции: сборник статей, [электронное издание сетевого распространения] / Под ред. Н.В. Емельянова. – М.: “КДУ”, “Добросвет”, 2021. – 149 с. ◽

10.31453/kdu.ru.978-5-7913-1176-4-2021-28-33 ◽

2021 ◽

Author(s):

Natalia Pavlovna Persteneva ◽

◽

Darya Dmitrievn Skryleva ◽

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Learning Model ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Methods ◽

Machine Learning Model ◽

Popular Classes ◽

Machine Learning Models

The article discusses machine learning methods. Using the example of two popular classes: supervised learning and unsupervised learning. Variants of the main types of machine learning models for each method are presented. A generalized algorithm for building any machine learning model is formed.

Download Full-text

Technical note: how to rationally compare the performances of different machine learning models?

10.7287/peerj.preprints.26714 ◽

2018 ◽

Cited By ~ 1

Author(s):

Terazima Maeda

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Learning Model ◽

Technical Note ◽

Learning Models ◽

Specific Data ◽

Machine Learning Model ◽

Specific Prediction ◽

Data Group ◽

Machine Learning Models

Nowadays, there is a large number of machine learning models that could be used for various areas. However, different research targets are usually sensitive to the type of models. For a specific prediction target, the predictive accuracy of a machine learning model is always dependent to the data feature, data size and the intrinsic relationship between inputs and outputs. Therefore, for a specific data group and a fixed prediction mission, how to rationally compare the predictive accuracy of different machine learning model is a big question. In this brief note, we show how should we compare the performances of different machine models by raising some typical examples.

Download Full-text

Development and Validation of a Quick Sepsis-Related Organ Failure Assessment-Based Machine-Learning Model for Mortality Prediction in Patients with Suspected Infection in the Emergency Department

Journal of Clinical Medicine ◽

10.3390/jcm9030875 ◽

2020 ◽

Vol 9 (3) ◽

pp. 875

Author(s):

Young Suk Kwon ◽

Moon Seong Baek

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Learning Model ◽

Gradient Boosting ◽

Learning Models ◽

Suspected Infection ◽

Machine Learning Model ◽

Failure Assessment ◽

Qsofa Score ◽

Machine Learning Models

The quick sepsis-related organ failure assessment (qSOFA) score has been introduced to predict the likelihood of organ dysfunction in patients with suspected infection. We hypothesized that machine-learning models using qSOFA variables for predicting three-day mortality would provide better accuracy than the qSOFA score in the emergency department (ED). Between January 2016 and December 2018, the medical records of patients aged over 18 years with suspected infection were retrospectively obtained from four EDs in Korea. Data from three hospitals (n = 19,353) were used as training-validation datasets and data from one (n = 4234) as the test dataset. Machine-learning algorithms including extreme gradient boosting, light gradient boosting machine, and random forest were used. We assessed the prediction ability of machine-learning models using the area under the receiver operating characteristic (AUROC) curve, and DeLong’s test was used to compare AUROCs between the qSOFA scores and qSOFA-based machine-learning models. A total of 447,926 patients visited EDs during the study period. We analyzed 23,587 patients with suspected infection who were admitted to the EDs. The median age of the patients was 63 years (interquartile range: 43–78 years) and in-hospital mortality was 4.0% (n = 941). For predicting three-day mortality among patients with suspected infection in the ED, the AUROC of the qSOFA-based machine-learning model (0.86 [95% CI 0.85–0.87]) for three -day mortality was higher than that of the qSOFA scores (0.78 [95% CI 0.77–0.79], p < 0.001). For predicting three-day mortality in patients with suspected infection in the ED, the qSOFA-based machine-learning model was found to be superior to the conventional qSOFA scores.

Download Full-text

Machine learning models for Hg prospecting in the Almadén mining district

10.5194/egusphere-egu21-7339 ◽

2021 ◽

Author(s):

Julio Alberto López-Gómez ◽

Daniel Carrasco Pardo ◽

Pablo Higueras ◽

Jose María Esbrí ◽

Saturnino Lorenzo

Keyword(s):

Machine Learning ◽

Learning Model ◽

Mining District ◽

Learning Models ◽

Geological Features ◽

Machine Learning Model ◽

Mineral Prospectivity ◽

Data Point ◽

The One ◽

Machine Learning Models

Traditionally, prospectivity models were designed using approaches mainly based on expert judgement. These models have been widely applied and they are also known as knowledge-driven prospectivity models (see Harris et al. (2015)). Currently, artificial intelligence approaches, especially machine learning models, are being applied to build prospectivity models since they have been proven to be successful in many other domains (see Sun et al., 2019 and Guerra Prado et al., 2020). They are also known as data-driven prospectivity models. Machine learning models allow to learn from data repositories in order to extract and detect relationships from the data to predict new instances.In this work, a geological dataset was collected by a team of expert geologists. The data collected includes the geographical coordinates as well as several geological features of points belonged to seventy-seven different mercury deposits in the Almad&#233;n mining district. The resulting dataset is composed by a total of 24798 points and 24 attributes for each point. In particular, we have collected geological and mining-related data regarding the Almad&#233;n mercury (Hg) mining district; these data include the location of the several Hg mineralizations, including their typology, size, mineralogy, and stratigraphic position, as well as other information associated to the metallogenetic model set up by Hern&#225;ndez et al. (1999).Later, few machine learning models are built to select the one which offers the best results. The aim of this work is twofold: on the one hand, it is intended to build a machine learning model capable of, given the geological features of a data point, to determine the mercury deposit to which it belongs. On the other hand, the aim is to build a machine learning model capable of, given the geological features of a data point, to identify the kind of deposit to which it belongs. The experiments conducted in this work have been properly designed, validating the results obtained using statistical techniques.Finally, the models built in this work will allow to generate mercury prospectivity maps. The final aim of this process is to get and train a system able to perform antimony prospection in the nearby Guadalmez syncline.This work was funded by the ANR (ANR-19-MIN2-0002-01), the AEI (MICIU/AEI/REF.: PCI2019-103779) and author&#8217;s institutions in the framework of the ERA-MIN2 AUREOLE project.ReferencesGuerra Prado E.M.; de Souza Filho C.R.; Carranza E.M.; Motta J.G. (2020). Modeling of Cu-Au prospectivity in the Caraj&#225;s mineral province (Brasil) through machine learning: Dealing with embalanced training data.Harris, J.R.; Grunsky, E.; Corrigan, D. (2015). Data- and knowledge-driven mineral prospectivity maps for Canda&#8217;s North.Hern&#225;ndez, A.; J&#233;brak, M.; Higueras, P.; Oyarzun, R.; Morata, D.; Munh&#225;, J. (1999). The Almad&#233;n mercury mining district, Spain. Mineralium Deposita, 34: 539-548.Sun, T.; Chen, F.; Zhong, L.; Liu, W.; Wang, Y. (2019). GIS-based mineral prospectivity mapping using machine learning methods: A case study from Tongling ore district, eastern China.

Download Full-text

Deep Learning Applications in Medical Imaging

Deep Learning Applications in Medical Imaging - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-7998-5071-7.ch008 ◽

2021 ◽

pp. 178-208

Author(s):

S. Sasikala ◽

S. J. Subhashini ◽

P. Alli ◽

J. Jane Rubel Angelina

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Medical Imaging ◽

Learning Model ◽

Learning Models ◽

Driving System ◽

Machine Learning Model ◽

Car Driving ◽

Machine Learning Models

Machine learning is a technique of parsing data, learning from that data, and then applying what has been learned to make informed decisions. Deep learning is actually a subset of machine learning. It technically is machine learning and functions in the same way, but it has different capabilities. The main difference between deep and machine learning is, machine learning models become well progressively, but the model still needs some guidance. If a machine learning model returns an inaccurate prediction, then the programmer needs to fix that problem explicitly, but in the case of deep learning, the model does it by itself. Automatic car driving system is a good example of deep learning. On other hand, Artificial Intelligence is a different thing from machine learning and deep learning. Deep learning and machine learning both are the subsets of AI.

Download Full-text

Hybrid Modeling In Unconventional Reservoirs To Forecast Estimated Ultimate Recovery

10.4043/31010-ms ◽

2021 ◽

Author(s):

Cenk Temizel ◽

Celal Hakan Canbaz ◽

Karthik Balaji ◽

Ahsen Ozesen ◽

Kirill Yanidis ◽

...

Keyword(s):

Machine Learning ◽

Integrated Approach ◽

Learning Model ◽

Hybrid Modeling ◽

Unconventional Reservoirs ◽

Learning Models ◽

Production Forecasting ◽

Machine Learning Model ◽

Machine Learning Models ◽

Estimated Ultimate Recovery

Abstract Machine learning models have worked as a robust tool in forecasting and optimization processes for wells in conventional, data-rich reservoirs. In unconventional reservoirs however, given the large ranges of uncertainty, purely data-driven, machine learning models have not yet proven to be repeatable and scalable. In such cases, integrating physics-based reservoir simulation methods along with machine learning techniques can be used as a solution to alleviate these limitations. The objective of this study is to provide an overview along with examples of implementing this integrated approach for the purpose of forecasting Estimated Ultimate Recovery (EUR) in shale reservoirs. This study is solely based on synthetic data. To generate data for one section of a reservoir, a full-physics reservoir simulator has been used. Simulated data from this section is used to train a machine learning model, which provides EUR as the output. Production from another section of the field with a different range of reservoir properties is then forecasted using a physics-based model. Using the earlier trained model, production forecasting for this section of the reservoir is then carried out to illustrate the integrated approach to EUR forecasting for a section of the reservoir that is not data rich. The integrated approach, or hybrid modeling, production forecasting for different sections of the reservoir that were data-starved, are illustrated. Using the physics-based model, the uncertainty in EUR predictions made by the machine learning model has been reduced and a more accurate forecasting has been attained. This method is primarily applicable in reservoirs, such as unconventionals, where one section of the field that has been developed has a substantial amount of data, whereas, the other section of the field will be data starved. The hybrid model was consistently able to forecast EUR at an acceptable level of accuracy, thereby, highlighting the benefits of this type of an integrated approach. This study advances the application of repeatable and scalable hybrid models in unconventional reservoirs and highlights its benefits as compared to using either physics-based or machine-learning based models separately.

Download Full-text

Building malware classificators usable by State security agencies

ITECKNE Innovación e Investigación en Ingeniería ◽

10.15332/iteckne.v15i2.2072 ◽

2018 ◽

Vol 15 (2) ◽

pp. 107-121

Author(s):

David Esteban Useche-Peláez ◽

Daniel Orlando Díaz-López ◽

Daniela Sepúlveda-Alzate ◽

Diego Edison Cabuya-Padilla

Keyword(s):

Machine Learning ◽

Learning Model ◽

Malware Analysis ◽

Learning Models ◽

Rigorous Analysis ◽

Powerful Technique ◽

State Security ◽

Machine Learning Model ◽

Machine Learning Models

Sandboxing has been used regularly to analyze software samples and determine if these contain suspicious properties or behaviors. Even if sandboxing is a powerful technique to perform malware analysis, it requires that a malware analyst performs a rigorous analysis of the results to determine the nature of the sample: goodware or malware. This paper proposes two machine learning models able to classify samples based on signatures and permissions obtained through Cuckoo sandbox, Androguard and VirusTotal. The developed models are also tested obtaining an acceptable percentage of correctly classified samples, being in this way useful tools for a malware analyst. A proposal of architecture for an IoT sentinel that uses one of the developed machine learning model is also showed. Finally, different approaches, perspectives, and challenges about the use of sandboxing and machine learning by security teams in State security agencies are also shared.

Download Full-text

Multitask machine learning models for predicting lipophilicity (logP) in the SAMPL7 challenge

Journal of Computer-Aided Molecular Design ◽

10.1007/s10822-021-00405-6 ◽

2021 ◽

Author(s):

Eelke B. Lenselink ◽

Pieter F. W. Stouten

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Drug Discovery ◽

Message Passing ◽

Learning Model ◽

Molecular Structures ◽

Learning Models ◽

Final Model ◽

Machine Learning Model ◽

Machine Learning Models

AbstractAccurate prediction of lipophilicity—logP—based on molecular structures is a well-established field. Predictions of logP are often used to drive forward drug discovery projects. Driven by the SAMPL7 challenge, in this manuscript we describe the steps that were taken to construct a novel machine learning model that can predict and generalize well. This model is based on the recently described Directed-Message Passing Neural Networks (D-MPNNs). Further enhancements included: both the inclusion of additional datasets from ChEMBL (RMSE improvement of 0.03), and the addition of helper tasks (RMSE improvement of 0.04). To the best of our knowledge, the concept of adding predictions from other models (Simulations Plus logP and [email protected], respectively) as helper tasks is novel and could be applied in a broader context. The final model that we constructed and used to participate in the challenge ranked 2/17 ranked submissions with an RMSE of 0.66, and an MAE of 0.48 (submission: Chemprop). On other datasets the model also works well, especially retrospectively applied to the SAMPL6 challenge where it would have ranked number one out of all submissions (RMSE of 0.35). Despite the fact that our model works well, we conclude with suggestions that are expected to improve the model even further.

Download Full-text

Interpretation of Maturity-Onset Diabetes of the Young Genetic Variants Based on American College of Medical Genetics and Genomics Criteria: Machine-Learning Model Development

JMIR Biomedical Engineering ◽

10.2196/20506 ◽

2020 ◽

Vol 5 (1) ◽

pp. e20506

Author(s):

Yichuan Liu ◽

Hui-Qi Qu ◽

Adam S Wenocur ◽

Jingchun Qu ◽

Xiao Chang ◽

...

Keyword(s):

Machine Learning ◽

Learning Model ◽

Medical Genetics ◽

Maturity Onset Diabetes ◽

Learning Models ◽

Onset Diabetes ◽

Machine Learning Model ◽

Dna Variants ◽

Genetics And Genomics ◽

Machine Learning Models

Background Maturity-onset diabetes of the young (MODY) is a group of dominantly inherited monogenic diabetes, with HNF4A-MODY, GCK-MODY, and HNF1A-MODY as the three most common forms based on the causal genes. Molecular diagnosis of MODY is important for precise treatment. Although a DNA variant causing MODY can be assessed based on the criteria of the American College of Medical Genetics and Genomics (ACMG) guidelines, gene-specific assessment of disease-causing mutations is important to differentiate among MODY subtypes. As the ACMG criteria were not originally designed for machine-learning algorithms, they are not true independent variables. Objective The aim of this study was to develop machine-learning models for interpretation of DNA variants and MODY diagnosis using the ACMG criteria. Methods We applied machine-learning models for interpretation of DNA variants in MODY genes defined by the ACMG criteria based on the Human Gene Mutation Database (HGMD) and ClinVar database. Results With a machine-learning procedure, we found that the weight matrix of the ACMG criteria was significantly different between the three MODY genes HNF1A, HNF4A, and GCK. The models showed high predictive abilities with accuracy over 95%. Conclusions Our results highlight the need for applying different weights of the ACMG criteria in relation to different MODY genes for accurate functional classification. As proof of principle, we applied the ACMG criteria as feature vectors in a machine-learning model and obtained a precision-based result.

Download Full-text