dataset size Latest Research Papers

Supervised Learning Predictive Models for Automated Fracturing Treatment Design: A Workflow Based on Algorithm Comparison and Multiphysics Model Validation

10.2118/205310-ms ◽

2022 ◽

Author(s):

Abdul Muqtadir Khan ◽

Abdullah BinZiad ◽

Abdullah Al Subaii ◽

Turki Alqarni ◽

Mohamed Yassine Jelassi ◽

...

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Model Building ◽

Confusion Matrix ◽

Correlation Study ◽

Support Vector ◽

Fracture Geometry ◽

Algorithm Comparison ◽

Recent Developments ◽

Dataset Size

Abstract Diagnostic pumping techniques are used routinely in proppant fracturing design. The pumping process can be time consuming; however, it yields technical confidence in treatment and productivity optimization. Recent developments in data analytics and machine learning can aid in shortening operational workflows and enhance project economics. Supervised learning was applied to an existing database to streamline the process and affect the design framework. Five classification algorithms were used for this study. The database was constructed through heterogeneous reservoir plays from the injection/falloff outputs. The algorithms used were support vector machine, decision tree, random forest, multinomial, and XGBoost. The number of classes was sensitized to establish a balance between model accuracy and prediction granularity. Fifteen cases were developed for a comprehensive comparison. A complete machine learning framework was constructed to work through each case set along with hyperparameter tuning to maximize accuracy. After the model was finalized, an extensive field validation workflow was deployed. The target outputs selected for the model were crosslinked fluid efficiency, total proppant mass, and maximum proppant concentration. The unsupervised clustering technique with t-SNE algorithm that was used first lacked accuracy. Supervised classification models showed better predictions. Cross-validation techniques showed an increasing trend of prediction accuracy. Feature selection was done using one-variable-at-a-time (OVAT) and a simple feature correlation study. Because the number of features and the dataset size were small, no features were eliminated from the final model building. Accuracy and F1 score calculations were used from the confusion matrix for evaluation, XGBoost showed excellent results with an accuracy of 74 to 95% for the output parameters. Fluid efficiency was categorized into three classes and yielded an accuracy of 96%. Proppant concentration and proppant mass predictions showed 77% and 86% accuracy, respectively, for the six-class case. The combination of high accuracy and fine granularity confirmed the potential application of machine learning models. The ratio of training to testing (holdout) across all cases ranged from 80:20 to 70:30. Model validations were done through an inverse problem of predicting and matching the fracture geometry and treatment pressures from the machine learning model design and the actual net pressure match. The simulations were conducted using advanced multiphysics simulations. The advantages of this innovative design approach showed four areas of improvement: reduction in polymer consumption by 30%, reduction of the flowback time by 25%, reduction of water usage by 30%, and enhanced operational efficiency by 60 to 65%.

The Effect of Chemical Representation on Active Machine Learning Towards Closed-Loop Optimization

10.26434/chemrxiv-2022-htmn0 ◽

2022 ◽

Author(s):

Alexander Pomberger ◽

Antonio Pedrina McCarthy ◽

Ahmad Khan ◽

Simon Sung ◽

Connor Taylor ◽

...

Keyword(s):

Machine Learning ◽

Chemical Reaction ◽

Closed Loop ◽

Reaction Yield ◽

Catalytic Systems ◽

Loop Optimization ◽

Reaction Optimization ◽

Initial Dataset ◽

Dataset Size ◽

The Impact

Multivariate chemical reaction optimization involving catalytic systems is a non-trivial task due to the high number of tuneable parameters and discrete choices. Closed-loop optimization featuring active Machine Learning (ML) represents a powerful strategy for automating reaction optimization. However, the translation of chemical reaction conditions into a machine-readable format comes with the challenge of finding highly informative features which accurately capture the factors for reaction success and allow the model to learn efficiently. Herein, we compare the efficacy of different calculated chemical descriptors for a high throughput generated dataset to determine the impact on a supervised ML model when predicting reaction yield. Then, the effect of featurization and size of the initial dataset within a closed-loop reaction optimization was examined. Finally, the balance between descriptor complexity and dataset size was considered. Ultimately, tailored descriptors did not outperform simple generic representations, however, a larger initial dataset accelerated reaction optimization.

Childhood hearing impairment and fertility in Norway

Scientific Reports ◽

10.1038/s41598-021-04195-7 ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Vegard Skirbekk ◽

Éric Bonsang ◽

Bo Engdahl

Keyword(s):

Hearing Impairment ◽

Fixed Effects ◽

Unobserved Heterogeneity ◽

Hearing Threshold ◽

Pure Tone Audiometry ◽

Population Level ◽

Later Life ◽

Birth Cohorts ◽

Number Of Children ◽

Dataset Size

AbstractThere is a lack of studies assessing how hearing impairment relates to reproductive outcomes. We examined whether childhood hearing impairment (HI) affects reproductive patterns based on longitudinal Norwegian population level data for birth cohorts 1940–1980. We used Poisson regression to estimate the association between the number of children ever born and HI. The association with childlessness is estimated by a logit model. As a robustness check, we also estimated family fixed effects Poisson and logit models. Hearing was assessed at ages 7, 10 and 13, and reproduction was observed at adult ages until 2014. Air conduction hearing threshold levels were obtained by pure-tone audiometry at eight frequencies from 0.25 to 8 kHz. Fertility data were collected from Norwegian administrative registers. The combined dataset size was N = 50,022. Our analyses reveal that HI in childhood is associated with lower fertility in adulthood, especially for men. The proportion of childless individuals among those with childhood HI was almost twice as large as that of individuals with normal childhood hearing (20.8% vs. 10.7%). The negative association is robust to the inclusion of family fixed effects in the model that allow to control for the unobserved heterogeneity that are shared between siblings, including factors related to the upbringing and parent characteristics. Less family support in later life could add to the health challenges faced by those with HI. More attention should be given to how fertility relates to HI.

Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models

Reliability Engineering & System Safety ◽

10.1016/j.ress.2021.108048 ◽

2021 ◽

Vol 216 ◽

pp. 108048

Author(s):

Vijay Mohan Nagulapati ◽

Hyunjun Lee ◽

DaWoon Jung ◽

Boris Brigljevic ◽

Yunseok Choi ◽

...

Keyword(s):

Data Driven ◽

Prognostic Models ◽

Training Dataset ◽

Capacity Estimation ◽

Dataset Size

The Effect of the Dataset Size on the Accuracy of Software Defect Prediction Models: An Empirical Study

INTELIGENCIA ARTIFICIAL ◽

10.4114/intartif.vol24iss68pp72-88 ◽

2021 ◽

Vol 24 (68) ◽

pp. 72-88

Author(s):

Mohammad Alshayeb ◽

Mashaan A. Alshammari

Keyword(s):

Feature Selection ◽

Prediction Model ◽

Prediction Models ◽

Fault Prediction ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Dataset Size ◽

Defect Prediction Models ◽

Selection Algorithms

The ongoing development of computer systems requires massive software projects. Running the components of these huge projects for testing purposes might be a costly process; therefore, parameter estimation can be used instead. Software defect prediction models are crucial for software quality assurance. This study investigates the impact of dataset size and feature selection algorithms on software defect prediction models. We use two approaches to build software defect prediction models: a statistical approach and a machine learning approach with support vector machines (SVMs). The fault prediction model was built based on four datasets of different sizes. Additionally, four feature selection algorithms were used. We found that applying the SVM defect prediction model on datasets with a reduced number of measures as features may enhance the accuracy of the fault prediction model. Also, it directs the test effort to maintain the most influential set of metrics. We also found that the running time of the SVM fault prediction model is not consistent with dataset size. Therefore, having fewer metrics does not guarantee a shorter execution time. From the experiments, we found that dataset size has a direct influence on the SVM fault prediction model. However, reduced datasets performed the same or slightly lower than the original datasets.

NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

10.1101/2021.10.21.465343 ◽

2021 ◽

Author(s):

Thiago Peixoto Leal ◽

Vinicius C Furlan ◽

Mateus Henrique Gouveia ◽

Julia Maria Saraiva Duarte ◽

Pablo AS Fonseca ◽

...

Keyword(s):

Large Datasets ◽

Node Degree ◽

Degree Centrality ◽

Web Tool ◽

Pruning Method ◽

Dataset Size ◽

Independent Observations ◽

Related Individuals ◽

Minimal Reduction ◽

Number Of Individuals

Genetic and omics analyses frequently require independent observations, which is not guaranteed in real datasets. When relatedness can not be accounted for, solutions involve removing related individuals (or observations) and, consequently, a reduction of available data. We developed a network-based relatedness-pruning method that minimizes dataset reduction while removing unwanted relationships in a dataset. It uses node degree centrality metric to identify highly connected nodes (or individuals) and implements heuristics that approximate the minimal reduction of a dataset to allow its application to large datasets. NAToRA outperformed two popular methodologies (implemented in software PLINK and KING) by showing the best combination of effective relatedness-pruning, removing all relatives while keeping the largest possible number of individuals in all datasets tested and also, with similar or lesser reduction in genetic diversity. NAToRA is freely available, both as a standalone tool that can be easily incorporated as part of a pipeline, and as a graphical web tool that allows visualization of the relatedness networks. NAToRA also accepts a variety of relationship metrics as input, which facilitates its use. We also present a genealogies simulator software used for different tests performed in the manuscript.

COVID-19 Pneumonia Detection Using Optimized Deep Learning Techniques

Diagnostics ◽

10.3390/diagnostics11111972 ◽

2021 ◽

Vol 11 (11) ◽

pp. 1972

Author(s):

Abul Bashar ◽

Ghazanfar Latif ◽

Ghassen Ben Brahim ◽

Nazeeruddin Mohammad ◽

Jaafar Alghazo

Keyword(s):

Deep Learning ◽

Transfer Learning ◽

Data Augmentation ◽

Learning Algorithm ◽

The Body ◽

X Rays ◽

X Ray ◽

Clinical Observations ◽

Dataset Size ◽

Chest X Ray

It became apparent that mankind has to learn to live with and adapt to COVID-19, especially because the developed vaccines thus far do not prevent the infection but rather just reduce the severity of the symptoms. The manual classification and diagnosis of COVID-19 pneumonia requires specialized personnel and is time consuming and very costly. On the other hand, automatic diagnosis would allow for real-time diagnosis without human intervention resulting in reduced costs. Therefore, the objective of this research is to propose a novel optimized Deep Learning (DL) approach for the automatic classification and diagnosis of COVID-19 pneumonia using X-ray images. For this purpose, a publicly available dataset of chest X-rays on Kaggle was used in this study. The dataset was developed over three stages in a quest to have a unified COVID-19 entities dataset available for researchers. The dataset consists of 21,165 anterior-to-posterior and posterior-to-anterior chest X-ray images classified as: Normal (48%), COVID-19 (17%), Lung Opacity (28%) and Viral Pneumonia (6%). Data Augmentation was also applied to increase the dataset size to enhance the reliability of results by preventing overfitting. An optimized DL approach is implemented in which chest X-ray images go through a three-stage process. Image Enhancement is performed in the first stage, followed by Data Augmentation stage and in the final stage the results are fed to the Transfer Learning algorithms (AlexNet, GoogleNet, VGG16, VGG19, and DenseNet) where the images are classified and diagnosed. Extensive experiments were performed under various scenarios, which led to achieving the highest classification accuracy of 95.63% through the application of VGG16 transfer learning algorithm on the augmented enhanced dataset with freeze weights. This accuracy was found to be better as compared to the results reported by other methods in the recent literature. Thus, the proposed approach proved superior in performance as compared with that of other similar approaches in the extant literature, and it made a valuable contribution to the body of knowledge. Although the results achieved so far are promising, further work is planned to correlate the results of the proposed approach with clinical observations to further enhance the efficiency and accuracy of COVID-19 diagnosis.

Author Reputation Measurement on Question and Answer Sites by the Classification of Author-Generated Content

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021500479 ◽

2021 ◽

Vol 31 (10) ◽

pp. 1421-1445

Author(s):

Erhan Sezerer ◽

Samet Tenekeci ◽

Ali Acar ◽

Bora Baloğlu ◽

Selma Tekir

Keyword(s):

Software Engineering ◽

Design Patterns ◽

Binary Classification ◽

Grey Literature ◽

Language Models ◽

Superior Performance ◽

Reputation Measurement ◽

Objective Quality ◽

Question And Answer ◽

Dataset Size

In the field of software engineering, practitioners’ share in the constructed knowledge cannot be underestimated and is mostly in the form of grey literature (GL). GL is a valuable resource though it is subjective and lacks an objective quality assurance methodology. In this paper, a quality assessment scheme is proposed for question and answer (Q&A) sites. In particular, we target stack overflow (SO) and stack exchange (SE) sites. We model the problem of author reputation measurement as a classification task on the author-provided answers. The authors’ mean, median, and total answer scores are used as inputs for class labeling. State-of-the-art language models (BERT and DistilBERT) with a softmax layer on top are utilized as classifiers and compared to SVM and random baselines. Our best model achieves [Formula: see text] accuracy in binary classification in SO design patterns tag and [Formula: see text] accuracy in SE software engineering category. Superior performance in SE software engineering can be explained by its larger dataset size. In addition to quantitative evaluation, we provide qualitative evidence, which supports that the system’s predicted reputation labels match the quality of provided answers.

Detection of Pathology in Panoramic Radiographs via Machine Learning Using Neural Networks for Dataset Size Augmentation

Journal of Oral and Maxillofacial Surgery ◽

10.1016/j.joms.2021.08.019 ◽

2021 ◽

Vol 79 (10) ◽

pp. e6-e7

Author(s):

N. Mehandru ◽

W.L. Hicks ◽

A.K. Singh ◽

L. Hsu ◽

M.R. Markiewicz ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Panoramic Radiographs ◽

Dataset Size

Effects of Dataset Size and Interactions on the Prediction Performance of Logistic Regression and Deep Learning Models

Computer Methods and Programs in Biomedicine ◽

10.1016/j.cmpb.2021.106504 ◽

2021 ◽

pp. 106504

Author(s):

Alexandre Bailly ◽

Corentin Blanc ◽

Élie Francis ◽

Thierry Guillotin ◽

Fadi Jamal ◽

...

Keyword(s):

Logistic Regression ◽

Deep Learning ◽

Prediction Performance ◽

Learning Models ◽

Dataset Size

dataset size
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Supervised Learning Predictive Models for Automated Fracturing Treatment Design: A Workflow Based on Algorithm Comparison and Multiphysics Model Validation

The Effect of Chemical Representation on Active Machine Learning Towards Closed-Loop Optimization

Childhood hearing impairment and fertility in Norway

Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models

The Effect of the Dataset Size on the Accuracy of Software Defect Prediction Models: An Empirical Study

NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

COVID-19 Pneumonia Detection Using Optimized Deep Learning Techniques

Author Reputation Measurement on Question and Answer Sites by the Classification of Author-Generated Content

Detection of Pathology in Panoramic Radiographs via Machine Learning Using Neural Networks for Dataset Size Augmentation

Effects of Dataset Size and Interactions on the Prediction Performance of Logistic Regression and Deep Learning Models

Export Citation Format

dataset sizeRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Supervised Learning Predictive Models for Automated Fracturing Treatment Design: A Workflow Based on Algorithm Comparison and Multiphysics Model Validation

The Effect of Chemical Representation on Active Machine Learning Towards Closed-Loop Optimization

Childhood hearing impairment and fertility in Norway

Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models

The Effect of the Dataset Size on the Accuracy of Software Defect Prediction Models: An Empirical Study

NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

COVID-19 Pneumonia Detection Using Optimized Deep Learning Techniques

Author Reputation Measurement on Question and Answer Sites by the Classification of Author-Generated Content

Detection of Pathology in Panoramic Radiographs via Machine Learning Using Neural Networks for Dataset Size Augmentation

Effects of Dataset Size and Interactions on the Prediction Performance of Logistic Regression and Deep Learning Models

dataset size
Recently Published Documents