The Gamma Test

Author(s):  
Antonia J. Jones ◽  
Dafydd Evans ◽  
Steve Margetts ◽  
Peter J. Durrant

The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be expressed as a smooth relationship. In essence, it allows us to efficiently calculate that part of the variance of the output that cannot be accounted for by the existence of any smooth model based on the inputs, even though this model is unknown. A key aspect of this tool is its speed: the Gamma Test has time complexity O(Mlog M), where M is the number of datapoints. For data sets consisting of a few thousand points and a reasonable number of attributes, a single run of the Gamma Test typically takes a few seconds. In this chapter we will show how the Gamma Test can be used in the construction of predictive models and classifiers for numerical data. In doing so, we will demonstrate the use of this technique for feature selection, and for the selection of embedding dimension when dealing with a time-series.

2006 ◽  
Vol 39 (2) ◽  
pp. 262-266 ◽  
Author(s):  
R. J. Davies

Synchrotron sources offer high-brilliance X-ray beams which are ideal for spatially and time-resolved studies. Large amounts of wide- and small-angle X-ray scattering data can now be generated rapidly, for example, during routine scanning experiments. Consequently, the analysis of the large data sets produced has become a complex and pressing issue. Even relatively simple analyses become difficult when a single data set can contain many thousands of individual diffraction patterns. This article reports on a new software application for the automated analysis of scattering intensity profiles. It is capable of batch-processing thousands of individual data files without user intervention. Diffraction data can be fitted using a combination of background functions and non-linear peak functions. To compliment the batch-wise operation mode, the software includes several specialist algorithms to ensure that the results obtained are reliable. These include peak-tracking, artefact removal, function elimination and spread-estimate fitting. Furthermore, as well as non-linear fitting, the software can calculate integrated intensities and selected orientation parameters.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


1988 ◽  
Vol 254 (1) ◽  
pp. E104-E112
Author(s):  
B. Candas ◽  
J. Lalonde ◽  
M. Normand

The aim of this study is the selection of the number of compartments required for a model to represent the distribution and metabolism of corticotropin-releasing factor (CRF) in rats. The dynamics of labeled rat CRF were measured in plasma for seven rats after a rapid injection. The sampling schedule resulted from the combination of the two D-optimal sampling sets of times corresponding to both rival models. This protocol improved the numerical identifiability of the parameters and consequently facilitated the selection of the relevant model. A three-compartment model fits adequately to the seven individual dynamics and better represents four of them compared with the lower-order model. It was demonstrated, using simulations in which the measurement errors and the interindividual variability of the parameters are included, that his four-to-seven ratio of data sets is consistent with the relevance of the three-compartment model for every individual kinetic data set. Kinetic and metabolic parameters were then derived for each individual rat, their values being consistent with the prolonged effects of CRF on pituitary-adrenocortical secretion.


2019 ◽  
Vol 16 (9) ◽  
pp. 4008-4014
Author(s):  
Savita Wadhawan ◽  
Gautam Kumar ◽  
Vivek Bhatnagar

This paper presents the analysis of different population based algorithms for the rulebase generation from numerical data sets. As fuzzy rulebase generation is one of the key issues in fuzzy modeling. The algorithms are applied on a rapid Ni–Cd battery charger data set. In this paper, we compare the efficiency of different algorithms and conclude that SCA algorithms with local search give remarkable efficiency as compared to SCA algorithms alone. Also found that the efficiency of SCA with local search is comparable to memetic algorithms.


SPE Journal ◽  
2017 ◽  
Vol 23 (03) ◽  
pp. 719-736 ◽  
Author(s):  
Quan Cai ◽  
Wei Yu ◽  
Hwa Chi Liang ◽  
Jenn-Tai Liang ◽  
Suojin Wang ◽  
...  

Summary The oil-and-gas industry is entering an era of “big data” because of the huge number of wells drilled with the rapid development of unconventional oil-and-gas reservoirs during the past decade. The massive amount of data generated presents a great opportunity for the industry to use data-analysis tools to help make informed decisions. The main challenge is the lack of the application of effective and efficient data-analysis tools to analyze and extract useful information for the decision-making process from the enormous amount of data available. In developing tight shale reservoirs, it is critical to have an optimal drilling strategy, thereby minimizing the risk of drilling in areas that would result in low-yield wells. The objective of this study is to develop an effective data-analysis tool capable of dealing with big and complicated data sets to identify hot zones in tight shale reservoirs with the potential to yield highly productive wells. The proposed tool is developed on the basis of nonparametric smoothing models, which are superior to the traditional multiple-linear-regression (MLR) models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions. This data-analysis tool is capable of handling one response variable and multiple predictor variables. To validate our tool, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas horizontal wells from the Marcellus Shale. Results from the two case studies revealed that our tool not only can achieve much better predictive power than the traditional MLR models on identifying hot zones in the tight shale reservoirs but also can provide guidance on developing the optimal drilling and completion strategies (e.g., well length and depth, amount of proppant and water injected). By comparing results from the two data sets, we found that our tool can achieve model performance with the big data set (2,064 Marcellus wells) with only four predictor variables that is similar to that with the small data set (249 Bakken wells) with six predictor variables. This implies that, for big data sets, even with a limited number of available predictor variables, our tool can still be very effective in identifying hot zones that would yield highly productive wells. The data sets that we have access to in this study contain very limited completion, geological, and petrophysical information. Results from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible enough to take advantage of any additional engineering and geology data to allow the operators to gain insights on the impact of these factors on well performance.


2021 ◽  
Vol 79 (1) ◽  
Author(s):  
Romana Haneef ◽  
Sofiane Kab ◽  
Rok Hrzic ◽  
Sonsoles Fuentes ◽  
Sandrine Fosse-Edorh ◽  
...  

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.


2019 ◽  
Vol 5 (10) ◽  
pp. 2120-2130 ◽  
Author(s):  
Suraj Kumar ◽  
Thendiyath Roshni ◽  
Dar Himayoun

Reliable method of rainfall-runoff modeling is a prerequisite for proper management and mitigation of extreme events such as floods. The objective of this paper is to contrasts the hydrological execution of Emotional Neural Network (ENN) and Artificial Neural Network (ANN) for modelling rainfall-runoff in the Sone Command, Bihar as this area experiences flood due to heavy rainfall. ENN is a modified version of ANN as it includes neural parameters which enhance the network learning process. Selection of inputs is a crucial task for rainfall-runoff model. This paper utilizes cross correlation analysis for the selection of potential predictors. Three sets of input data: Set 1, Set 2 and Set 3 have been prepared using weather and discharge data of 2 raingauge stations and 1 discharge station located in the command for the period 1986-2014.  Principal Component Analysis (PCA) has then been performed on the selected data sets for selection of data sets showing principal tendencies.  The data sets obtained after PCA have then been used in the model development of ENN and ANN models. Performance indices were performed for the developed model for three data sets. The results obtained from Set 2 showed that ENN with R= 0.933, R2 = 0.870, Nash Sutcliffe = 0.8689, RMSE = 276.1359 and Relative Peak Error = 0.00879 outperforms ANN in simulating the discharge. Therefore, ENN model is suggested as a better model for rainfall-runoff discharge in the Sone command, Bihar.


Entropy ◽  
2019 ◽  
Vol 21 (11) ◽  
pp. 1051
Author(s):  
Jerzy W. Grzymala-Busse ◽  
Zdzislaw S. Hippe ◽  
Teresa Mroczek

Results of experiments on numerical data sets discretized using two methods—global versions of Equal Frequency per Interval and Equal Interval Width-are presented. Globalization of both methods is based on entropy. For discretized data sets left and right reducts were computed. For each discretized data set and two data sets, based, respectively, on left and right reducts, we applied ten-fold cross validation using the C4.5 decision tree generation system. Our main objective was to compare the quality of all three types of data sets in terms of an error rate. Additionally, we compared complexity of generated decision trees. We show that reduction of data sets may only increase the error rate and that the decision trees generated from reduced decision sets are not simpler than the decision trees generated from non-reduced data sets.


2012 ◽  
Vol 52 (No. 4) ◽  
pp. 188-196 ◽  
Author(s):  
Y. Lei ◽  
S. Y Zhang

Forestmodellers have long faced the problem of selecting an appropriate mathematical model to describe tree ontogenetic or size-shape empirical relationships for tree species. A common practice is to develop many models (or a model pool) that include different functional forms, and then to select the most appropriate one for a given data set. However, this process may impose subjective restrictions on the functional form. In this process, little attention is paid to the features (e.g. asymptote and inflection point rather than asymptote and nonasymptote) of different functional forms, and to the intrinsic curve of a given data set. In order to find a better way of comparing and selecting the growth models, this paper describes and analyses the characteristics of the Schnute model. This model has both flexibility and versatility that have not been used in forestry. In this study, the Schnute model was applied to different data sets of selected forest species to determine their functional forms. The results indicate that the model shows some desirable properties for the examined data sets, and allows for discerning the different intrinsic curve shapes such as sigmoid, concave and other curve shapes. Since no suitable functional form for a given data set is usually known prior to the comparison of candidate models, it is recommended that the Schnute model be used as the first step to determine an appropriate functional form of the data set under investigation in order to avoid using a functional form a priori.


2015 ◽  
Vol 8 (8) ◽  
pp. 2569-2586 ◽  
Author(s):  
M. Sprenger ◽  
H. Wernli

Abstract. Lagrangian trajectories are widely used in the atmospheric sciences, for instance to identify flow structures in extratropical cyclones (e.g., warm conveyor belts) and long-range transport pathways of moisture and trace substances. Here a new version of the Lagrangian analysis tool LAGRANTO (Wernli and Davies, 1997) is introduced, which offers considerably enhanced functionalities. Trajectory starting positions can be defined easily and flexibly based on different geometrical and/or meteorological conditions, e.g., equidistantly spaced within a prescribed region and on a stack of pressure (or isentropic) levels. After the computation of the trajectories, a versatile selection of trajectories is offered based on single or combined criteria. These criteria are passed to LAGRANTO with a simple command language (e.g., "GT:PV:2" readily translates into a selection of all trajectories with potential vorticity, PV, greater than 2 PVU; 1 PVU = 10−6 K m2 kg−1 s−1). Full versions of this new version of LAGRANTO are available for global ECMWF and regional COSMO data, and core functionality is provided for the regional WRF and MetUM models and the global 20th Century Reanalysis data set. The paper first presents the intuitive application of LAGRANTO for the identification of a warm conveyor belt in the North Atlantic. A further case study then shows how LAGRANTO can be used to quasi-operationally diagnose stratosphere–troposphere exchange events. Whereas these examples rely on the ECMWF version, the COSMO version and input fields with 7 km horizontal resolution serve to resolve the rather complex flow structure associated with orographic blocking due to the Alps, as shown in a third example. A final example illustrates the tool's application in source–receptor analysis studies. The new distribution of LAGRANTO is publicly available and includes auxiliary tools, e.g., to visualize trajectories. A detailed user guide describes all LAGRANTO capabilities.


Sign in / Sign up

Export Citation Format

Share Document