Entropy Ensemble Filter: Does information content assessment of bootstrapped training datasets before model training lead to better trade-off between ensemble size and predictive performance?

Mapping Intimacies ◽

10.5194/egusphere-egu2020-1963 ◽

2020 ◽

Author(s):

Hossein Foroozand ◽

Steven V. Weijs

Keyword(s):

Machine Learning ◽

Computational Cost ◽

Predictive Performance ◽

Original Data ◽

Training Data ◽

Computational Time ◽

Limiting Factor ◽

Ensemble Size ◽

Content Assessment ◽

Model Training

<p>Machine learning is the fast-growing branch of data-driven models, and its main objective is to use computational methods to become more accurate in predicting outcomes without being explicitly programmed. In this field, a way to improve model predictions is to use a large collection of models (called ensemble) instead of a single one. Each model is then trained on slightly different samples of the original data, and their predictions are averaged. This is called bootstrap aggregating, or Bagging, and is widely applied. A recurring question in previous works was: how to choose the ensemble size of training data sets for tuning the weights in machine learning? The computational cost of ensemble-based methods scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive performance. The choice of ensemble size was often determined based on the size of input data and available computational power, which can become a limiting factor for larger datasets and complex models&#8217; training. In this research, it is our hypothesis that if an ensemble of artificial neural networks (ANN) models or any other machine learning technique uses the most informative ensemble members for training purpose rather than all bootstrapped ensemble members, it could reduce the computational time substantially without negatively affecting the performance of simulation.</p>

Download Full-text

Concrete Crack Detection Based on Well-Known Feature Extractor Model and the YOLO_v2 Network

Applied Sciences ◽

10.3390/app11020813 ◽

2021 ◽

Vol 11 (2) ◽

pp. 813

Author(s):

Shuai Teng ◽

Zongchao Liu ◽

Gongfa Chen ◽

Li Cheng

Keyword(s):

Feature Extraction ◽

Crack Detection ◽

Computational Cost ◽

Concrete Structures ◽

Detection Algorithm ◽

Computational Time ◽

Image Size ◽

Important Indicator ◽

Feature Extractor ◽

Model Training

This paper compares the crack detection performance (in terms of precision and computational cost) of the YOLO_v2 using 11 feature extractors, which provides a base for realizing fast and accurate crack detection on concrete structures. Cracks on concrete structures are an important indicator for assessing their durability and safety, and real-time crack detection is an essential task in structural maintenance. The object detection algorithm, especially the YOLO series network, has significant potential in crack detection, while the feature extractor is the most important component of the YOLO_v2. Hence, this paper employs 11 well-known CNN models as the feature extractor of the YOLO_v2 for crack detection. The results confirm that a different feature extractor model of the YOLO_v2 network leads to a different detection result, among which the AP value is 0.89, 0, and 0 for ‘resnet18’, ‘alexnet’, and ‘vgg16’, respectively meanwhile, the ‘googlenet’ (AP = 0.84) and ‘mobilenetv2’ (AP = 0.87) also demonstrate comparable AP values. In terms of computing speed, the ‘alexnet’ takes the least computational time, the ‘squeezenet’ and ‘resnet18’ are ranked second and third respectively; therefore, the ‘resnet18’ is the best feature extractor model in terms of precision and computational cost. Additionally, through the parametric study (influence on detection results of the training epoch, feature extraction layer, and testing image size), the associated parameters indeed have an impact on the detection results. It is demonstrated that: excellent crack detection results can be achieved by the YOLO_v2 detector, in which an appropriate feature extractor model, training epoch, feature extraction layer, and testing image size play an important role.

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

A machine learning-based predictor for the identification of the recurrence of patients with gastric cancer after operation

Scientific Reports ◽

10.1038/s41598-021-81188-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chengmao Zhou ◽

Junhong Hu ◽

Ying Wang ◽

Mu-Huo Ji ◽

Jianhua Tong ◽

...

Keyword(s):

Machine Learning ◽

Gastric Cancer ◽

Learning Algorithms ◽

Test Group ◽

Operation Time ◽

Predictive Performance ◽

Original Data ◽

Postoperative Recurrence ◽

Machine Learning Algorithms ◽

Gastric Cancer Patients

AbstractTo explore the predictive performance of machine learning on the recurrence of patients with gastric cancer after the operation. The available data is divided into two parts. In particular, the first part is used as a training set (such as 80% of the original data), and the second part is used as a test set (the remaining 20% of the data). And we use fivefold cross-validation. The weight of recurrence factors shows the top four factors are BMI, Operation time, WGT and age in order. In training group:among the 5 machine learning models, the accuracy of gbm was 0.891, followed by gbm algorithm was 0.876; The AUC values of the five machine learning algorithms are from high to low as forest (0.962), gbm (0.922), GradientBoosting (0.898), DecisionTree (0.790) and Logistic (0.748). And the precision of the forest is the highest 0.957, followed by the GradientBoosting algorithm (0.878). At the same time, in the test group is as follows: the highest accuracy of Logistic was 0.801, followed by forest algorithm and gbm; the AUC values of the five algorithms are forest (0.795), GradientBoosting (0.774), DecisionTree (0.773), Logistic (0.771) and gbm (0.771), from high to low. Among the five machine learning algorithms, the highest precision rate of Logistic is 1.000, followed by the gbm (0.487). Machine learning can predict the recurrence of gastric cancer patients after an operation. Besides, the first four factors affecting postoperative recurrence of gastric cancer were BMI, Operation time, WGT and age.

Download Full-text

Nowcasting heavy precipitation over the Netherlands using a 13-year radar archive: a machine learning approach

10.5194/egusphere-egu21-12814 ◽

2021 ◽

Author(s):

Eva van der Kooij ◽

Marc Schleiss ◽

Riccardo Taormina ◽

Francesco Fioranelli ◽

Dorien Lugt ◽

...

Keyword(s):

Machine Learning ◽

The Netherlands ◽

Heavy Rainfall ◽

Predictive Performance ◽

Heavy Precipitation ◽

Early Warning Systems ◽

Training Data ◽

Short Term ◽

Data Set ◽

Radar Images

<p>Accurate short-term forecasts, also known as nowcasts, of heavy precipitation are desirable for creating early warning systems for extreme weather and its consequences, e.g. urban flooding. In this research, we explore the use of machine learning for short-term prediction of heavy rainfall showers in the Netherlands.</p><p>We assess the performance of a recurrent, convolutional neural network (TrajGRU) with lead times of 0 to 2 hours. The network is trained on a 13-year archive of radar images with 5-min temporal and 1-km spatial resolution from the precipitation radars of the Royal Netherlands Meteorological Institute (KNMI). We aim to train the model to predict the formation and dissipation of dynamic, heavy, localized rain events, a task for which traditional Lagrangian nowcasting methods still come up short.</p><p>We report on different ways to optimize predictive performance for heavy rainfall intensities through several experiments. The large dataset available provides many possible configurations for training. To focus on heavy rainfall intensities, we use different subsets of this dataset through using different conditions for event selection and varying the ratio of light and heavy precipitation events present in the training data set and change the loss function used to train the model.</p><p>To assess the performance of the model, we compare our method to current state-of-the-art Lagrangian nowcasting system from the pySTEPS library, like S-PROG, a deterministic approximation of an ensemble mean forecast. The results of the experiments are used to discuss the pros and cons of machine-learning based methods for precipitation nowcasting and possible ways to further increase performance.</p>

Download Full-text

Broadening volcanic eruption forecasting using transfer machine learning

10.5194/egusphere-egu21-970 ◽

2021 ◽

Author(s):

David Dempsey ◽

Shane Cronin ◽

Andreas Kempa-Liehr ◽

Martin Letourneur

Keyword(s):

Machine Learning ◽

Seismic Station ◽

Feature Space ◽

Forecast Model ◽

Linear Interpolation ◽

Lessons Learned ◽

Training Data ◽

Single Station ◽

Data Driven Approach ◽

Model Training

<p>Sudden steam-driven eruptions at tourist volcanoes were the cause of 63 deaths at Mt Ontake (Japan) in 2014, and 22 deaths at Whakaari (New Zealand) in 2019. Warning systems that can anticipate these eruptions could provide crucial hours for evacuation or sheltering but these require reliable forecasting. Recently, machine learning has been used to extract eruption precursors from observational data and train forecasting models. However, a weakness of this data-driven approach is its reliance on long observational records that span multiple eruptions. As many volcano datasets may only record one or no eruptions, there is a need to extend these techniques to data-poor locales.</p><p>Transfer machine learning is one approach for generalising lessons learned at data-rich volcanoes and applying them to data-poor ones. Here, we tackle two problems: (1) generalising time series features between seismic stations at Whakaari to address recording gaps, and (2) training a forecasting model for Mt Ruapehu augmented using data from Whakaari. This required that we standardise data records at different stations for direct comparisons, devise an interpolation scheme to fill in missing eruption data, and combine volcano-specific feature matrices prior to model training.</p><p>We trained a forecast model for Whakaari using tremor data from three eruptions recorded at one seismic station (WSRZ) and augmented by data from two other eruptions recorded at a second station (WIZ). First, the training data from both stations were standardised to a unit normal distribution in log space. Then, linear interpolation in feature space was used to infer missing eruption features at WSRZ. Under pseudo-prospective testing, the augmented model had similar forecasting skill to one trained using all five eruptions recorded at a single station (WIZ). However, extending this approach to Ruapehu, we saw reduced performance indicating that more work is needed in standardisation and feature selection.</p>

Download Full-text

Machine Learning Application to CO2 Foam Rheology

10.2118/208016-ms ◽

2021 ◽

Author(s):

Javad Iskandarov ◽

George Fanourgakis ◽

Waleed Alameri ◽

George Froudakis ◽

Georgios Karanikolos

Keyword(s):

Machine Learning ◽

Oil Recovery ◽

Experimental Studies ◽

Training Data ◽

Computational Time ◽

Gradient Boosting ◽

Operational Conditions ◽

Co2 Foam ◽

Modelling Techniques ◽

Foam Rheology

Abstract Conventional foam modelling techniques require tuning of too many parameters and long computational time in order to provide accurate predictions. Therefore, there is a need for alternative methodologies for the efficient and reliable prediction of the foams’ performance. Foams are susceptible to various operational conditions and reservoir parameters. This research aims to apply machine learning (ML) algorithms to experimental data in order to correlate important affecting parameters to foam rheology. In this way, optimum operational conditions for CO2 foam enhanced oil recovery (EOR) can be determined. In order to achieve that, five different ML algorithms were applied to experimental rheology data from various experimental studies. It was concluded that the Gradient Boosting (GB) algorithm could successfully fit the training data and give the most accurate predictions for unknown cases.

Download Full-text

Machine Learning Techniques for Network Intrusion Detection

Dynamic and Advanced Data Mining for Progressing Technological Development ◽

10.4018/978-1-60566-908-3.ch012 ◽

2010 ◽

pp. 273-299 ◽

Cited By ~ 1

Author(s):

Tich Phuoc Tran ◽

Pohsiang Tsai ◽

Tony Jan ◽

Xiangjian He

Keyword(s):

Machine Learning ◽

Network Security ◽

Intrusion Detection ◽

Computer Systems ◽

Computational Cost ◽

Training Data ◽

Machine Learning Techniques ◽

Complex Nature ◽

Processing Power ◽

Linear Relationships

Most of the currently available network security techniques are not able to cope with the dynamic and increasingly complex nature of cyber attacks on distributed computer systems. Therefore, an automated and adaptive defensive tool is imperative for computer networks. Alongside the existing prevention techniques such as encryption and firewalls, Intrusion Detection System (IDS) has established itself as an emerging technology that is able to detect unauthorized access and abuse of computer systems by both internal users and external offenders. Most of the novel approaches in this field have adopted Artificial Intelligence (AI) technologies such as Artificial Neural Networks (ANN) to improve performance as well as robustness of IDS. The true power and advantages of ANN lie in its ability to represent both linear and non-linear relationships and learn these relationships directly from the data being modeled. However, ANN is computationally expensive due to its demanding processing power and this leads to overfitting problem, i.e. the network is unable to extrapolate accurately once the input is outside of the training data range. These limitations challenge IDS with low detection rate, high false alarm rate and excessive computation cost. This chapter proposes a novel Machine Learning (ML) algorithm to alleviate those difficulties of existing AI techniques in the area of computer network security. The Intrusion Detection dataset provided by Knowledge Discovery and Data Mining (KDD-99) is used as a benchmark to compare our model with other existing techniques. Extensive empirical analysis suggests that the proposed method outperforms other state-of-the-art learning algorithms in terms of learning bias, generalization variance and computational cost. It is also reported to significantly improve the overall detection capability for difficult-to-detect novel attacks which are unseen or irregularly occur in the training phase.

Download Full-text

Machine Learning Models of Survival Prediction in Trauma Patients

Journal of Clinical Medicine ◽

10.3390/jcm8060799 ◽

2019 ◽

Vol 8 (6) ◽

pp. 799 ◽

Cited By ~ 7

Author(s):

Cheng-Shyuan Rau ◽

Shao-Chun Wu ◽

Jung-Fang Chuang ◽

Chun-Ying Huang ◽

Hang-Tsung Liu ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Predictive Performance ◽

Original Data ◽

High Accuracy ◽

Validation Dataset ◽

Survival Prediction ◽

Trauma Patients ◽

Data Set ◽

Test Dataset

Background: We aimed to build a model using machine learning for the prediction of survival in trauma patients and compared these model predictions to those predicted by the most commonly used algorithm, the Trauma and Injury Severity Score (TRISS). Methods: Enrolled hospitalized trauma patients from 2009 to 2016 were divided into a training dataset (70% of the original data set) for generation of a plausible model under supervised classification, and a test dataset (30% of the original data set) to test the performance of the model. The training and test datasets comprised 13,208 (12,871 survival and 337 mortality) and 5603 (5473 survival and 130 mortality) patients, respectively. With the provision of additional information such as pre-existing comorbidity status or laboratory data, logistic regression (LR), support vector machine (SVM), and neural network (NN) (with the Stuttgart Neural Network Simulator (RSNNS)) were used to build models of survival prediction and compared to the predictive performance of TRISS. Predictive performance was evaluated by accuracy, sensitivity, and specificity, as well as by area under the curve (AUC) measures of receiver operating characteristic curves. Results: In the validation dataset, NN and the TRISS presented the highest score (82.0%) for balanced accuracy, followed by SVM (75.2%) and LR (71.8%) models. In the test dataset, NN had the highest balanced accuracy (75.1%), followed by the TRISS (70.2%), SVM (70.6%), and LR (68.9%) models. All four models (LR, SVM, NN, and TRISS) exhibited a high accuracy of more than 97.5% and a sensitivity of more than 98.6%. However, NN exhibited the highest specificity (51.5%), followed by the TRISS (41.5%), SVM (40.8%), and LR (38.5%) models. Conclusions: These four models (LR, SVM, NN, and TRISS) exhibited a similar high accuracy and sensitivity in predicting the survival of the trauma patients. In the test dataset, the NN model had the highest balanced accuracy and predictive specificity.

Download Full-text

Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions

10.1101/2020.10.15.340802 ◽

2020 ◽

Author(s):

Yannis Pantazis ◽

Christos Tselas ◽

Kleanthi Lakiotaki ◽

Vincenzo Lagani ◽

Ioannis Tsamardinos

Keyword(s):

Principal Component ◽

Predictive Performance ◽

Original Data ◽

Relevant Information ◽

Computational Time ◽

Additive Interaction ◽

Human Transcriptome ◽

Feature Spaces ◽

Reconstruction Performance ◽

Low Dimensional

AbstractHigh-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets, despite the limited sample size of each dataset and the biological / technological heterogeneity across studies. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.

Download Full-text

Speeding Up Discovery of Auxetic Zeolite Frameworks by Machine Learning

10.26434/chemrxiv.11796150 ◽

2020 ◽

Author(s):

Romain Gaillac ◽

Siwar Chibani ◽

François-Xavier Coudert

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Dft Calculations ◽

Force Field ◽

Computational Cost ◽

Training Data ◽

Elastic Response ◽

Data Set ◽

Classical Level

<div> <div> <div> <p>The characterization of the mechanical properties of crystalline materials is nowadays considered a routine computational task in DFT calculations. However, its high computational cost still prevents it from being used in high-throughput screening methodologies, where a cheaper estimate of the elastic properties of a material is required. In this work, we have investigated the accuracy of force field calculations for the prediction of mechanical properties, and in particular for the characterization of the directional Poisson’s ratio. We analyze the behavior of about 600,000 hypothetical zeolitic structures at the classical level (a scale three orders of magnitude larger than previous studies), to highlight generic trends between mechanical properties and energetic stability. By comparing these results with DFT calculations on 991 zeolitic frameworks, we highlight the limitations of force field predictions, in particular for predicting auxeticity. We then used this reference DFT data as a training set for a machine learning algorithm, showing that it offers a way to build fast and reliable predictive models for anisotropic properties. The accuracies obtained are, in particular, much better than the current “cheap” approach for screening, which is the use of force fields. These results are a significant improvement over the previous work, due to the more difficult nature of the properties studied, namely the anisotropic elastic response. It is also the first time such a large training data set is used for zeolitic materials. </p></div></div></div><div><div><div> </div> </div> </div>

Download Full-text