Unsupervised entropy-based selection of data sets for improved model fitting

Author(s):  
Pedro M. Ferreira
1995 ◽  
Vol 31 (2) ◽  
pp. 193-204 ◽  
Author(s):  
Koen Grijspeerdt ◽  
Peter Vanrolleghem ◽  
Willy Verstraete

A comparative study of several recently proposed one-dimensional sedimentation models has been made. This has been achieved by fitting these models to steady-state and dynamic concentration profiles obtained in a down-scaled secondary decanter. The models were evaluated with several a posteriori model selection criteria. Since the purpose of the modelling task is to do on-line simulations, the calculation time was used as one of the selection criteria. Finally, the practical identifiability of the models for the available data sets was also investigated. It could be concluded that the model of Takács et al. (1991) gave the most reliable results.


Author(s):  
Christian Luksch ◽  
Lukas Prost ◽  
Michael Wimmer

We present a real-time rendering technique for photometric polygonal lights. Our method uses a numerical integration technique based on a triangulation to calculate noise-free diffuse shading. We include a dynamic point in the triangulation that provides a continuous near-field illumination resembling the shape of the light emitter and its characteristics. We evaluate the accuracy of our approach with a diverse selection of photometric measurement data sets in a comprehensive benchmark framework. Furthermore, we provide an extension for specular reflection on surfaces with arbitrary roughness that facilitates the use of existing real-time shading techniques. Our technique is easy to integrate into real-time rendering systems and extends the range of possible applications with photometric area lights.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Dinesh Verma ◽  
Shishir Kumar

Nowadays, software developers are facing challenges in minimizing the number of defects during the software development. Using defect density parameter, developers can identify the possibilities of improvements in the product. Since the total number of defects depends on module size, so there is need to calculate the optimal size of the module to minimize the defect density. In this paper, an improved model has been formulated that indicates the relationship between defect density and variable size of modules. This relationship could be used for optimization of overall defect density using an effective distribution of modules sizes. Three available data sets related to concern aspect have been examined with the proposed model by taking the distinct values of variables and parameter by putting some constraint on parameters. Curve fitting method has been used to obtain the size of module with minimum defect density. Goodness of fit measures has been performed to validate the proposed model for data sets. The defect density can be optimized by effective distribution of size of modules. The larger modules can be broken into smaller modules and smaller modules can be merged to minimize the overall defect density.


2017 ◽  
Vol 21 (9) ◽  
pp. 4747-4765 ◽  
Author(s):  
Clara Linés ◽  
Micha Werner ◽  
Wim Bastiaanssen

Abstract. The implementation of drought management plans contributes to reduce the wide range of adverse impacts caused by water shortage. A crucial element of the development of drought management plans is the selection of appropriate indicators and their associated thresholds to detect drought events and monitor the evolution. Drought indicators should be able to detect emerging drought processes that will lead to impacts with sufficient anticipation to allow measures to be undertaken effectively. However, in the selection of appropriate drought indicators, the connection to the final impacts is often disregarded. This paper explores the utility of remotely sensed data sets to detect early stages of drought at the river basin scale and determine how much time can be gained to inform operational land and water management practices. Six different remote sensing data sets with different spectral origins and measurement frequencies are considered, complemented by a group of classical in situ hydrologic indicators. Their predictive power to detect past drought events is tested in the Ebro Basin. Qualitative (binary information based on media records) and quantitative (crop yields) data of drought events and impacts spanning a period of 12 years are used as a benchmark in the analysis. Results show that early signs of drought impacts can be detected up to 6 months before impacts are reported in newspapers, with the best correlation–anticipation relationships for the standard precipitation index (SPI), the normalised difference vegetation index (NDVI) and evapotranspiration (ET). Soil moisture (SM) and land surface temperature (LST) offer also good anticipation but with weaker correlations, while gross primary production (GPP) presents moderate positive correlations only for some of the rain-fed areas. Although classical hydrological information from water levels and water flows provided better anticipation than remote sensing indicators in most of the areas, correlations were found to be weaker. The indicators show a consistent behaviour with respect to the different levels of crop yield in rain-fed areas among the analysed years, with SPI, NDVI and ET providing again the stronger correlations. Overall, the results confirm remote sensing products' ability to anticipate reported drought impacts and therefore appear as a useful source of information to support drought management decisions.


Author(s):  
Anastasiia Ivanitska ◽  
Dmytro Ivanov ◽  
Ludmila Zubik

The analysis of the available methods and models of formation of recommendations for the potential buyer in network information systems for the purpose of development of effective modules of selection of advertising is executed. The effectiveness of the use of machine learning technologies for the analysis of user preferences based on the processing of data on purchases made by users with a similar profile is substantiated. A model of recommendation formation based on machine learning technology is proposed, its work on test data sets is tested and the adequacy of the RMSE model is assessed. Keywords: behavior prediction; advertising based on similarity; collaborative filtering; matrix factorization; big data; machine learning


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


Author(s):  
Xinshui Yu ◽  
Zhaohui Yang ◽  
Kunling Song ◽  
Tianxiang Yu ◽  
Bozhi Guo

The distribution and parameters of the random variables is an important part of conventional reliability analysis methods, such as Monte Carlo method, which should be known fist before using these methods, but it is often hard or impossible to obtain. Model-free sampling technique puts forward a method to get the distribution of the random variables, but the accuracy of the extended sample generated by it is not enough. This paper presented an improved model-free sampling technique, which is based on Bootstrap methods, to increase the accuracy of the extended sample and decrease the iteration times. In this improved model-free sampling technique, the method of the selection of initial sample points and the generation of iterative sample is improved. Meanwhile, a center distance criterion, which considers the local characteristics of the extended sample, is added to the generating criterion of dissimilarity measure. The effectiveness of this improved method is illustrated through some numerical examples.


Geophysics ◽  
2016 ◽  
Vol 81 (2) ◽  
pp. V141-V150 ◽  
Author(s):  
Emanuele Forte ◽  
Matteo Dossi ◽  
Michele Pipan ◽  
Anna Del Ben

We have applied an attribute-based autopicking algorithm to reflection seismics with the aim of reducing the influence of the user’s subjectivity on the picking results and making the interpretation faster with respect to manual and semiautomated techniques. Our picking procedure uses the cosine of the instantaneous phase to automatically detect and mark as a horizon any recorded event characterized by lateral phase continuity. A patching procedure, which exploits horizon parallelism, can be used to connect consecutive horizons marking the same event but separated by noise-related gaps. The picking process marks all coherent events regardless of their reflection strength; therefore, a large number of independent horizons can be constructed. To facilitate interpretation, horizons marking different phases of the same reflection can be automatically grouped together and specific horizons from each reflection can be selected using different possible methods. In the phase method, the algorithm reconstructs the reflected wavelets by averaging the cosine of the instantaneous phase along each horizon. The resulting wavelets are then locally analyzed and confronted through crosscorrelation, allowing the recognition and selection of specific reflection phases. In case the reflected wavelets cannot be recovered due to shape-altering processing or a low signal-to-noise ratio, the energy method uses the reflection strength to group together subparallel horizons within the same energy package and to select those satisfying either energy or arrival time criteria. These methods can be applied automatically to all the picked horizons or to horizons individually selected by the interpreter for specific analysis. We show examples of application to 2D reflection seismic data sets in complex geologic and stratigraphic conditions, critically reviewing the performance of the whole process.


2018 ◽  
Vol 11 (11) ◽  
pp. 6203-6230 ◽  
Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Andrew P. Morse ◽  
Martin W. Gallagher

Abstract. Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will respond differently in the presence of ultraviolet light, potentially allowing for different types of biological aerosol to be discriminated. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the Wideband Integrated Bioaerosol Sensor (WIBS) has allowed for size, morphology and fluorescence measurements to be collected in real-time. However, it is unclear without studying instrument responses in the laboratory, the extent to which different types of particles can be discriminated. Collection of laboratory data is vital to validate any approach used to analyse data and ensure that the data available is utilized as effectively as possible. In this paper a variety of methodologies are tested on a range of particles collected in the laboratory. Hierarchical agglomerative clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting. Whilst HAC was able to effectively discriminate between reference narrow-size distribution PSL particles, yielding a classification error of only 1.8 %, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5 % and 24.2 %. Furthermore, there is a large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable to attain consistent results across the different sets of laboratory generated aerosol tested. The lowest classification errors were obtained using gradient boosting, where the misclassification rate was between 4.38 % and 5.42 %. The largest contribution to the error, in the case of the higher misclassification rate, was the pollen samples where 28.5 % of the samples were incorrectly classified as fungal spores. The technique was robust to changes in data preparation provided a fluorescent threshold was applied to the data. In the event that laboratory training data are unavailable, DBSCAN was found to be a potential alternative to HAC. In the case of one of the data sets where 22.9 % of the data were left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42 % on the classified data. These results could not be replicated for the other data set where 26.8 % of the data were not classified and a classification error of 13.8 % was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring a different selection of parameters depending on the preparation used. Further analysis will also be required to confirm our selection of the parameters when using this method on ambient data. There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely to improve on current discrimination between pollen, bacteria and fungal spores and even between different species, however the need for extensive laboratory data sets will grow as a result.


2016 ◽  
Author(s):  
Andreas Ostler ◽  
Ralf Sussmann ◽  
Prabir K. Patra ◽  
Sander Houweling ◽  
Marko De Bruine ◽  
...  

Abstract. The distribution of methane (CH4) in the stratosphere can be a major driver of spatial variability in the dry-air column-averaged CH4 mixing ratio (XCH4), which is being measured increasingly for the assessment of CH4 surface emissions. Chemistry-transport models (CTMs) therefore need to simulate the tropospheric and stratospheric fractional columns of XCH4 accurately for estimating surface emissions from XCH4. Simulations from three CTMs are tested against XCH4 observations from the Total Carbon Column Network (TCCON). We analyze how the model-TCCON agreement in XCH4 depends on the model representation of stratospheric CH4 distributions. Model equivalents of TCCON XCH4 are computed with stratospheric CH4 fields from both the model simulations and from satellite-based CH4 distributions from MIPAS (Michelson Interferometer for Passive Atmospheric Sounding) and MIPAS CH4 fields adjusted to ACE-FTS (Atmospheric Chemistry Experiment Fourier Transform Spectrometer) observations. In comparison to simulated model fields we find an improved model-TCCON XCH4 agreement for all models with MIPAS-based stratospheric CH4 fields. For the Atmospheric Chemistry Transport Model (ACTM) the average XCH4 bias is significantly reduced from 38.1 ppb to 13.7 ppb, whereas small improvements are found for the models TM5 (Transport Model, version 5; from 8.7 ppb to 4.3 ppb), and LMDz (Laboratoire de Météorologie Dynamique model with Zooming capability; from 6.8 ppb to 4.3 ppb), respectively. MIPAS stratospheric CH4 fields adjusted to ACE-FTS reduce the average XCH4 bias for ACTM (3.3 ppb), but increase the average XCH4 bias for TM5 (10.8 ppb) and LMDz (20.0 ppb). These findings imply that the range of satellite-based stratospheric CH4 is insufficient to resolve a possible stratospheric contribution to differences in total column CH4 between TCCON and TM5 or LMDz. Applying transport diagnostics to the models indicates that model-to-model differences in the simulation of stratospheric transport, notably the age of stratospheric air, can largely explain the inter-model spread in stratospheric CH4 and, hence, its contribution to XCH4. This implies that there is a need to better understand the impact of individual model transport components (e.g., physical parameterization, meteorological data sets, model horizontal/vertical resolution) on modeled stratospheric CH4.


Sign in / Sign up

Export Citation Format

Share Document