Subset selection of training data for machine learning: a situational awareness system case study

Author(s):  
M. McKenzie ◽  
S. C. Wong
2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
R Haneef ◽  
S Fuentes ◽  
R Hrzic ◽  
S Fosse-Edorh ◽  
S Kab ◽  
...  

Abstract Background The use of artificial intelligence is increasing to estimate and predict health outcomes from large data sets. The main objectives were to develop two algorithms using machine learning techniques to identify new cases of diabetes (case study I) and to classify type 1 and type 2 (case study II) in France. Methods We selected the training data set from a cohort study linked with French national Health database (i.e., SNDS). Two final datasets were used to achieve each objective. A supervised machine learning method including eight following steps was developed: the selection of the data set, case definition, coding and standardization of variables, split data into training and test data sets, variable selection, training, validation and selection of the model. We planned to apply the trained models on the SNDS to estimate the incidence of diabetes and the prevalence of type 1/2 diabetes. Results For the case study I, 23/3468 and for case study II, 14/3481 SNDS variables were selected based on an optimal balance between variance explained and using the ReliefExp algorithm. We trained four models using different classification algorithms on the training data set. The Linear Discriminant Analysis model performed best in both case studies. The models were assessed on the test datasets and achieved a specificity of 67% and a sensitivity of 62% in case study I, and a specificity of 97 % and sensitivity of 100% in case study II. The case study II model was applied to the SNDS and estimated the prevalence of type 1 diabetes in 2016 in France of 0.3% and for type 2, 4.4%. The case study model I was not applied to the SNDS. Conclusions The case study II model to estimate the prevalence of type 1/2 diabetes has good performance and will be used in routine surveillance. The case study I model to identify new cases of diabetes showed a poor performance due to missing necessary information on determinants of diabetes and will need to be improved for further research.


2021 ◽  
Author(s):  
Octavian Dumitru ◽  
Gottfried Schwarz ◽  
Mihai Datcu ◽  
Dongyang Ao ◽  
Zhongling Huang ◽  
...  

<p>During the last years, much progress has been reached with machine learning algorithms. Among the typical application fields of machine learning are many technical and commercial applications as well as Earth science analyses, where most often indirect and distorted detector data have to be converted to well-calibrated scientific data that are a prerequisite for a correct understanding of the desired physical quantities and their relationships.</p><p>However, the provision of sufficient calibrated data is not enough for the testing, training, and routine processing of most machine learning applications. In principle, one also needs a clear strategy for the selection of necessary and useful training data and an easily understandable quality control of the finally desired parameters.</p><p>At a first glance, one could guess that this problem could be solved by a careful selection of representative test data covering many typical cases as well as some counterexamples. Then these test data can be used for the training of the internal parameters of a machine learning application. At a second glance, however, many researchers found out that a simple stacking up of plain examples is not the best choice for many scientific applications.</p><p>To get improved machine learning results, we concentrated on the analysis of satellite images depicting the Earth’s surface under various conditions such as the selected instrument type, spectral bands, and spatial resolution. In our case, such data are routinely provided by the freely accessible European Sentinel satellite products (e.g., Sentinel-1, and Sentinel-2). Our basic work then included investigations of how some additional processing steps – to be linked with the selected training data – can provide better machine learning results.</p><p>To this end, we analysed and compared three different approaches to find out machine learning strategies for the joint selection and processing of training data for our Earth observation images:</p><ul><li>One can optimize the training data selection by adapting the data selection to the specific instrument, target, and application characteristics [1].</li> <li>As an alternative, one can dynamically generate new training parameters by Generative Adversarial Networks. This is comparable to the role of a sparring partner in boxing [2].</li> <li>One can also use a hybrid semi-supervised approach for Synthetic Aperture Radar images with limited labelled data. The method is split in: polarimetric scattering classification, topic modelling for scattering labels, unsupervised constraint learning, and supervised label prediction with constraints [3].</li> </ul><p>We applied these strategies in the ExtremeEarth sea-ice monitoring project (http://earthanalytics.eu/). As a result, we can demonstrate for which application cases these three strategies will provide a promising alternative to a simple conventional selection of available training data.</p><p>[1] C.O. Dumitru et. al, “Understanding Satellite Images: A Data Mining Module for Sentinel Images”, Big Earth Data, 2020, 4(4), pp. 367-408.</p><p>[2] D. Ao et. al., “Dialectical GAN for SAR Image Translation: From Sentinel-1 to TerraSAR-X”, Remote Sensing, 2018, 10(10), pp. 1-23.</p><p>[3] Z. Huang, et. al., "HDEC-TFA: An Unsupervised Learning Approach for Discovering Physical Scattering Properties of Single-Polarized SAR Images", IEEE Transactions on Geoscience and Remote Sensing, 2020, pp.1-18.</p>


2017 ◽  
Author(s):  
Reuben Binns ◽  
Michael Veale ◽  
Max Van Kleek ◽  
Nigel Shadbolt

The internet has become a central medium through which 'networked publics' express their opinions and engage in debate. Offensive comments and personal attacks can inhibit participation in these spaces. Automated content moderation aims to overcome this problem using machine learning classifiers trained on large corpora of texts manually annotated for offence. While such systems could help encourage more civil debate, they must navigate inherently normatively contestable boundaries, and are subject to the idiosyncratic norms of the human raters who provide the training data. An important objective for platforms implementing such measures might be to ensure that they are not unduly biased towards or against particular norms of offence. This paper provides some exploratory methods by which the normative biases of algorithmic content moderation systems can be measured, by way of a case study using an existing dataset of comments labelled for offence. We train classifiers on comments labelled by different demographic subsets (men and women) to understand how differences in conceptions of offence between these groups might affect the performance of the resulting models on various test sets. We conclude by discussing some of the ethical choices facing the implementers of algorithmic moderation systems, given various desired levels of diversity of viewpoints amongst discussion participants.


2020 ◽  
Vol 44 (7-8) ◽  
pp. 499-514
Author(s):  
Yi Zheng ◽  
Hyunjung Cheon ◽  
Charles M. Katz

This study explores advanced techniques in machine learning to develop a short tree-based adaptive classification test based on an existing lengthy instrument. A case study was carried out for an assessment of risk for juvenile delinquency. Two unique facts of this case are (a) the items in the original instrument measure a large number of distinctive constructs; (b) the target outcomes are of low prevalence, which renders imbalanced training data. Due to the high dimensionality of the items, traditional item response theory (IRT)-based adaptive testing approaches may not work well, whereas decision trees, which are developed in the machine learning discipline, present as a promising alternative solution for adaptive tests. A cross-validation study was carried out to compare eight tree-based adaptive test constructions with five benchmark methods using data from a sample of 3,975 subjects. The findings reveal that the best-performing tree-based adaptive tests yielded better classification accuracy than the benchmark method IRT scoring with optimal cutpoints, and yielded comparable or better classification accuracy than the best benchmark method, random forest with balanced sampling. The competitive classification accuracy of the tree-based adaptive tests also come with an over 30-fold reduction in the length of the instrument, only administering between 3 to 6 items to any individual. This study suggests that tree-based adaptive tests have an enormous potential when used to shorten instruments that measure a large variety of constructs.


2021 ◽  
Author(s):  
Thomas Stanley ◽  
Dalia Kirschbaum ◽  
Robert Emberson

<p>The Landslide Hazard Assessment for Situational Awareness system (LHASA) gives a global view of landslide hazard in nearly real time. Currently, it is being upgraded from version 1 to version 2, which entails improvements along several dimensions. These include the incorporation of new predictors, machine learning, and new event-based landslide inventories. As a result, LHASA version 2 substantially improves on the prior performance and introduces a probabilistic element to the global landslide nowcast.</p><p>Data from the soil moisture active-passive (SMAP) satellite has been assimilated into a globally consistent data product with a latency less than 3 days, known as SMAP Level 4. In LHASA, these data represent the antecedent conditions prior to landslide-triggering rainfall. In some cases, soil moisture may have accumulated over a period of many months. The model behind SMAP Level 4 also estimates the amount of snow on the ground, which is an important factor in some landslide events. LHASA also incorporates this information as an antecedent condition that modulates the response to rainfall. Slope, lithology, and active faults were also used as predictor variables. These factors can have a strong influence on where landslides initiate. LHASA relies on precipitation estimates from the Global Precipitation Measurement mission to identify the locations where landslides are most probable. The low latency and consistent global coverage of these data make them ideal for real-time applications at continental to global scales. LHASA relies primarily on rainfall from the last 24 hours to spot hazardous sites, which is rescaled by the local 99<sup>th</sup> percentile rainfall. However, the multi-day latency of SMAP requires the use of a 2-day antecedent rainfall variable to represent the accumulation of rain between the antecedent soil moisture and current rainfall.</p><p>LHASA merges these predictors with XGBoost, a commonly used machine-learning tool, relying on historical landslide inventories to develop the relationship between landslide occurrence and various risk factors. The resulting model relies heavily on current daily rainfall, but other factors also play an important role. LHASA outputs the probability of landslide occurrence on a grid of roughly one kilometer over all continents from 60 North to 60 South latitude. Evaluation over the period 2019-2020 shows that LHASA version 2 doubles the accuracy of the global landslide nowcast without increasing the global false alarm rate.</p><p>LHASA also identifies the areas where the human exposure to landslide hazard is most intense. Landslide hazard is divided into 4 levels: minimal, low, moderate, and high. Next, the number of persons and the length of major roads (primary and secondary roads) within each of these areas is calculated for every second-level administrative district (county). These results can be viewed through a web portal hosted at the Goddard Space Flight Center. In addition, users can download daily hazard and exposure data.</p><p>LHASA version 2 uses machine learning and satellite data to identify areas of probable landslide hazard within hours of heavy rainfall. Its global maps are significantly more accurate, and it now includes rapid estimates of exposed populations and infrastructure. In addition, a forecast mode will be implemented soon.</p>


Author(s):  
Carlos Sáez ◽  
Nekane Romero ◽  
J Alberto Conejero ◽  
Juan M García-Gómez

Abstract Objective The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. Materials and Methods We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. Results Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. Conclusions Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.


2020 ◽  
pp. 1-12
Author(s):  
Yu Guangxu

The 21st century is an era of rapid development of the Internet. Internet technology is widely used in various fields. With the rapid development of network, the importance of network information security is also highlighted. The traditional network information security technology has been difficult to ensure the security of network information. Therefore, we mainly study the application of machine learning feature extraction method in situational awareness system. A feature selection method based on machine learning is proposed to extract situational features.By analyzing whether the background of network information is safe or not, and according to the current research situation at home and abroad and the trend of Internet development, this paper tries out the practical application of machine learning feature extraction method in a certain perception system. Based on the above points, a selection method based on machine learning is proposed to extract situational features. The accuracy and timeliness of situational awareness system detection are seriously affected by the high dimension, noise and redundant features of massive network traffic data.Therefore, it is of great value to further study network intrusion detection technology on the basis of machine learning.


Geosciences ◽  
2022 ◽  
Vol 12 (1) ◽  
pp. 27
Author(s):  
Talha Siddique ◽  
Md Mahmud ◽  
Amy Keesee ◽  
Chigomezyo Ngwira ◽  
Hyunju Connor

With the availability of data and computational technologies in the modern world, machine learning (ML) has emerged as a preferred methodology for data analysis and prediction. While ML holds great promise, the results from such models are not fully unreliable due to the challenges introduced by uncertainty. An ML model generates an optimal solution based on its training data. However, if the uncertainty in the data and the model parameters are not considered, such optimal solutions have a high risk of failure in actual world deployment. This paper surveys the different approaches used in ML to quantify uncertainty. The paper also exhibits the implications of quantifying uncertainty when using ML by performing two case studies with space physics in focus. The first case study consists of the classification of auroral images in predefined labels. In the second case study, the horizontal component of the perturbed magnetic field measured at the Earth’s surface was predicted for the study of Geomagnetically Induced Currents (GICs) by training the model using time series data. In both cases, a Bayesian Neural Network (BNN) was trained to generate predictions, along with epistemic and aleatoric uncertainties. Finally, the pros and cons of both Gaussian Process Regression (GPR) models and Bayesian Deep Learning (DL) are weighed. The paper also provides recommendations for the models that need exploration, focusing on space weather prediction.


Information ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 154 ◽  
Author(s):  
Ricardo Resende de Mendonça ◽  
Daniel Felix de Brito ◽  
Ferrucio de Franco Rosa ◽  
Júlio Cesar dos Reis ◽  
Rodrigo Bonacin

Criminals use online social networks for various activities by including communication, planning, and execution of criminal acts. They often employ ciphered posts using slang expressions, which are restricted to specific groups. Although literature shows advances in analysis of posts in natural language messages, such as hate discourses, threats, and more notably in the sentiment analysis; research enabling intention analysis of posts using slang expressions is still underexplored. We propose a framework and construct software prototypes for the selection of social network posts with criminal slang expressions and automatic classification of these posts according to illocutionary classes. The developed framework explores computational ontologies and machine learning (ML) techniques. Our defined Ontology of Criminal Expressions represents crime concepts in a formal and flexible model, and associates them with criminal slang expressions. This ontology is used for selecting suspicious posts and decipher them. In our solution, the criminal intention in written posts is automatically classified relying on learned models from existing posts. This work carries out a case study to evaluate the framework with 8,835,290 tweets. The obtained results show its viability by demonstrating the benefits in deciphering posts and the effectiveness of detecting user’s intention in written criminal posts based on ML.


Sign in / Sign up

Export Citation Format

Share Document