Deal gently with the bird you are trying to catch: small scale CD control with machine learning

Author(s):  
Christian Bürgel ◽  
Martin Sczyrba ◽  
Clemens S. Utzny
Keyword(s):  
2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


2021 ◽  
Author(s):  
Aleksandar Kovačević ◽  
Jelena Slivka ◽  
Dragan Vidaković ◽  
Katarina-Glorija Grujić ◽  
Nikola Luburić ◽  
...  

<p>Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. </p><p>This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT).<br></p><p>We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach.<br></p><p>This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.<br></p>


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Primož Godec ◽  
Matjaž Pančur ◽  
Nejc Ilenič ◽  
Andrej Čopar ◽  
Martin Stražar ◽  
...  

Abstract Analysis of biomedical images requires computational expertize that are uncommon among biomedical scientists. Deep learning approaches for image analysis provide an opportunity to develop user-friendly tools for exploratory data analysis. Here, we use the visual programming toolbox Orange (http://orange.biolab.si) to simplify image analysis by integrating deep-learning embedding, machine learning procedures, and data visualization. Orange supports the construction of data analysis workflows by assembling components for data preprocessing, visualization, and modeling. We equipped Orange with components that use pre-trained deep convolutional networks to profile images with vectors of features. These vectors are used in image clustering and classification in a framework that enables mining of image sets for both novel and experienced users. We demonstrate the utility of the tool in image analysis of progenitor cells in mouse bone healing, identification of developmental competence in mouse oocytes, subcellular protein localization in yeast, and developmental morphology of social amoebae.


2020 ◽  
Vol 9 (4) ◽  
pp. 230 ◽  
Author(s):  
Izabela Karsznia ◽  
Karolina Sielicka

Effective settlements generalization for small-scale maps is a complex and challenging task. Developing a consistent methodology for generalizing small-scale maps has not gained enough attention, as most of the research conducted so far has concerned large scales. In the study reported here, we want to fill this gap and explore settlement characteristics, named variables that can be decisive in settlement selection for small-scale maps. We propose 33 variables, both thematic and topological, which may be of importance in the selection process. To find essential variables and assess their weights and correlations, we use machine learning (ML) models, especially decision trees (DT) and decision trees supported by genetic algorithms (DT-GA). With the use of ML models, we automatically classify settlements as selected and omitted. As a result, in each tested case, we achieve automatic settlement selection, an improvement in comparison with the selection based on official national mapping agency (NMA) guidelines and closer to the results obtained in manual map generalization conducted by experienced cartographers.


2020 ◽  
Author(s):  
Tom Rowan ◽  
Adrian Butler

&lt;p&gt;&lt;span&gt;In order to enable community groups and other interested parties to evaluate the effects of flood management, water conservation and other hydrological issues, better localised mapping is required.&amp;#160; Although some maps are publicly available many are behind paywalls, especially those with three dimensional features. &amp;#160;In this study London is used as a test case to evaluate, machine learning and rules-based approaches with opensource maps and LiDAR data to create more accurate representations (LOD2) of small-scale areas. &amp;#160;Machine learning is particularly well suited to the recognition of local repetitive features like building roofs and trees, while roads can be identified and mapped best using a faster rules-based approach. &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;In order to create a useful LOD2 representation, a user interface, processing rules manipulation and assumption editor have all been incorporated. Features like randomly assigning sub terrain features (basements) - using Monte-Carlo methods - and artificial sewage representation enable the user to grow these models from opensource data into useful model inputs. This project is aimed at local scale hydrological modelling, rainfall runoff analysis and other local planning applications. &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;&amp;#160;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;The goal is to provide turn-key data processing for small scale modelling, which should help advance the installation of SuDs and other water management solutions, as well as having broader uses. The method is designed to enable fast and accurate representations of small-scale features (1 hectare to 1km&lt;sup&gt;2&lt;/sup&gt;), with larger scale applications planned for future work. &amp;#160;This work forms part of the CAMELLIA project (Community Water Management for a Liveable London) and aims to provide useful tools for local scale modeller and possibly the larger scale industry/scientific user. &lt;/span&gt;&lt;/p&gt;


2020 ◽  
Author(s):  
Sarah Schönbrodt-Stitt ◽  
Paolo Nasta ◽  
Nima Ahmadian ◽  
Markus Kurtenbach ◽  
Christopher Conrad ◽  
...  

&lt;p&gt;Mapping near-surface soil moisture (&lt;em&gt;&amp;#952;&lt;/em&gt;) is of tremendous relevance for a broad range of environment-related disciplines and meteorological, ecological, hydrological and agricultural applications. Globally available products offer the opportunity to address &lt;em&gt;&amp;#952;&lt;/em&gt; in large-scale modelling with coarse spatial resolution such as at the landscape level. However, &lt;em&gt;&amp;#952;&lt;/em&gt; estimation at higher spatial resolution is of vital importance for many small-scale applications. Therefore, we focus our study on a small-scale catchment (MFC2) belonging to the &amp;#8220;Alento&amp;#8221; hydrological observatory, located in southern Italy (Campania Region). The goal of this study is to develop new machine-learning approaches to estimate high grid-resolution (about 17 m cell size) &lt;em&gt;&amp;#952;&lt;/em&gt; maps from mainly backscatter measurements retrieved from C-band Synthetic Aperture Radar (SAR) based on Sentinel-1 (S1) images and from gridded terrain attributes. Thus, a workflow comprising a total of 48 SAR-based &lt;em&gt;&amp;#952;&lt;/em&gt; patterns estimated for 24 satellite overpass dates (revisit time of 6 days) each with ascendant and descendent orbits will be presented. To enable for the mapping, SAR-based &lt;em&gt;&amp;#952;&lt;/em&gt; data was calibrated with in-situ measurements carried out with a portable device during eight measurement campaigns at time of satellite overpasses (four overpass days in total with each ascendant and descendent satellite overpasses per day in November 2018). After the calibration procedure, data validation was executed from November 10, 2018 till March 28, 2019 by using two stationary sensors monitoring &lt;em&gt;&amp;#952;&lt;/em&gt; at high-temporal (1-min recording time). The specific sensor locations reflected two contrasting field conditions, one bare soil plot (frequently kept clear, without disturbance of vegetation cover) and one non-bare soil plot (real-world condition). Point-scale ground observations of &lt;em&gt;&amp;#952;&lt;/em&gt; were compared to pixel-scale (17 m &amp;#215; 17 m), SAR-based &lt;em&gt;&amp;#952;&lt;/em&gt; estimated for those pixels corresponding to the specific positions of the stationary sensors. Mapping performance was estimated through the root mean squared error (RMSE). For a short-term time series of &lt;em&gt;&amp;#952;&lt;/em&gt; (Nov 2018) integrating 136 in situ, sensor-based &lt;em&gt;&amp;#952;&lt;/em&gt; (&lt;em&gt;&amp;#952;&lt;/em&gt;&lt;sub&gt;insitu&lt;/sub&gt;) and 74 gravimetric-based &lt;em&gt;&amp;#952;&lt;/em&gt; (&lt;em&gt;&amp;#952;&lt;/em&gt;&lt;sub&gt;gravimetric&lt;/sub&gt;) measurements during a total of eight S1 overpasses, mapping performance already proved to be satisfactory with RMSE=0.039 m&amp;#179;m&lt;sup&gt;-&lt;/sup&gt;&amp;#179; and R&amp;#178;=0.92, respectively with RMSE=0.041 m&amp;#179;m&lt;sup&gt;-&lt;/sup&gt;&amp;#179; and R&amp;#178;=0.91. First results further reveal that estimated satellite-based &lt;em&gt;&amp;#952;&lt;/em&gt; patterns respond to the evolution of rainfall. With our workflow developed and results, we intend to contribute to improved environmental risk assessment by assimilating the results into hydrological models (e.g., HydroGeoSphere), and to support future studies on combined ground-based and SAR-based &lt;em&gt;&amp;#952;&lt;/em&gt; retrieval for forested land (future missions operating at larger wavelengths e.g. NISARL-band, Biomass P-band sensors).&lt;/p&gt;


2018 ◽  
Author(s):  
Iason-Zois Gazis ◽  
Timm Schoening ◽  
Evangelos Alevizos ◽  
Jens Greinert

Abstract. In this study, high-resolution bathymetric multibeam and optical image data, both obtained within the Belgian manganese (Mn) nodule mining license area by the autonomous underwater vehicle (AUV) Abyss, were combined in order to create a predictive Random Forests (RF) machine learning model. AUV bathymetry reveals small-scale terrain variations, allowing slope estimations and calculation of bathymetric derivatives such as slope, curvature, and ruggedness. Optical AUV imagery provides quantitative information regarding the distribution (number and median size) of Mn-nodules. Within the area considered in this study, Mn-nodules show a heterogeneous and spatially clustered pattern and their number per square meter is negatively correlated with their median size. A prediction of the number of Mn-nodules was achieved by combining information derived from the acoustic and optical data using a RF model. This model was tuned by examining the influence of the training set size, the number of growing trees (ntree) and the number of predictor variables to be randomly selected at each RF node (mtry) on the RF prediction accuracy. The use of larger training data sets with higher ntree and mtry values increases the accuracy. To estimate the Mn-nodule abundance, these predictions were linked to ground truth data acquired by box coring. Linking optical and hydro-acoustic data revealed a non-linear relationship between the Mn-nodule distribution and topographic characteristics. This highlights the importance of a detailed terrain reconstruction for a predictive modelling of Mn-nodule abundance. In addition, this study underlines the necessity of a sufficient spatial distribution of the optical data to provide reliable modelling input for the RF.


10.29007/ntlb ◽  
2018 ◽  
Author(s):  
Thibault Gauthier ◽  
Cezary Kaliszyk ◽  
Josef Urban

Techniques combining machine learning with translation to automated reasoning have recently become an important component of formal proof assistants. Such “hammer” techniques complement traditional proof assistant automation as implemented by tactics and decision procedures. In this paper we present a unified proof assistant automation approach which attempts to automate the selection of appropriate tactics and tactic-sequences combined with an optimized small-scale hammering approach. We implement the technique as a tactic-level automation for HOL4: TacticToe. It implements a modified A*-algorithm directly in HOL4 that explores different tactic-level proof paths, guiding their selection by learning from a large number of previous tactic-level proofs. Unlike the existing hammer methods, TacticToe avoids translation to FOL, working directly on the HOL level. By combining tactic prediction and premise selection, TacticToe is able to re-prove 39% of 7902 HOL4 theorems in 5 seconds whereas the best single HOL(y)Hammer strategy solves 32% in the same amount of time.


Author(s):  
Dixian Zhu ◽  
Changjie Cai ◽  
Tianbao Yang ◽  
Xun Zhou

In this paper, we tackle air quality forecasting by using machine learning approaches to predict the hourly concentration of air pollutants (e.g., Ozone, PM2.5 and Sulfur Dioxide). Machine learning, as one of the most popular techniques, is able to efficiently train a model on big data by using large-scale optimization algorithms. Although there exists some works applying machine learning to air quality prediction, most of the prior studies are restricted to small scale data and simply train standard regression models (linear or non-linear) to predict the hourly air pollution concentration. In this work, we propose refined models to predict the hourly air pollution concentration based on meteorological data of previous days by formulating the prediction of 24 hours as a multi-task learning problem. It enables us to select a good model with different regularization techniques. We propose a useful regularization by enforcing the prediction models of consecutive hours to be close to each other, and compare with several typical regularizations for multi-task learning including standard Frobenius norm regularization, nuclear norm regularization, ℓ2,1 norm regularization. Our experiments show the proposed formulations and regularization achieve better performance than existing standard regression models and existing regularizations.


Sign in / Sign up

Export Citation Format

Share Document