GENERATING BALANCED LEARNING AND TEST SETS FOR FUNCTION APPROXIMATION PROBLEMS

2011 ◽  
Vol 21 (03) ◽  
pp. 247-263 ◽  
Author(s):  
J. P. FLORIDO ◽  
H. POMARES ◽  
I. ROJAS

In function approximation problems, one of the most common ways to evaluate a learning algorithm consists in partitioning the original data set (input/output data) into two sets: learning, used for building models, and test, applied for genuine out-of-sample evaluation. When the partition into learning and test sets does not take into account the variability and geometry of the original data, it might lead to non-balanced and unrepresentative learning and test sets and, thus, to wrong conclusions in the accuracy of the learning algorithm. How the partitioning is made is therefore a key issue and becomes more important when the data set is small due to the need of reducing the pessimistic effects caused by the removal of instances from the original data set. Thus, in this work, we propose a deterministic data mining approach for a distribution of a data set (input/output data) into two representative and balanced sets of roughly equal size taking the variability of the data set into consideration with the purpose of allowing both a fair evaluation of learning's accuracy and to make reproducible machine learning experiments usually based on random distributions. The sets are generated using a combination of a clustering procedure, especially suited for function approximation problems, and a distribution algorithm which distributes the data set into two sets within each cluster based on a nearest-neighbor approach. In the experiments section, the performance of the proposed methodology is reported in a variety of situations through an ANOVA-based statistical study of the results.

Entropy ◽  
2021 ◽  
Vol 23 (1) ◽  
pp. 126
Author(s):  
Sharu Theresa Jose ◽  
Osvaldo Simeone

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.


2020 ◽  
Vol 10 (1) ◽  
pp. 1-11
Author(s):  
Arvind Shrivastava ◽  
Nitin Kumar ◽  
Kuldeep Kumar ◽  
Sanjeev Gupta

The paper deals with the Random Forest, a popular classification machine learning algorithm to predict bankruptcy (distress) for Indian firms. Random Forest orders firms according to their propensity to default or their likelihood to become distressed. This is also useful to explain the association between the tendency of firm failure and its features. The results are analyzed vis-à-vis Tree Net. Both in-sample and out of sample estimations have been performed to compare Random Forest with Tree Net, which is a cutting edge data mining tool known to provide satisfactory estimation results. An exhaustive data set comprising companies from varied sectors have been included in the analysis. It is found that Tree Net procedure provides improved classification and predictive performance vis-à-vis Random Forest methodology consistently that may be utilized further by industry analysts and researchers alike for predictive purposes.


2017 ◽  
Author(s):  
Bernardo A. Mello ◽  
Yuhai Tu

To decipher molecular mechanisms in biological systems from system-level input-output data is challenging especially for complex processes that involve interactions among multiple components. Here, we study regulation of the multi-domain (P1-5) histidine kinase CheA by the MCP chemoreceptors. We develop a network model to describe dynamics of the system treating the receptor complex with CheW and P3P4P5 domains of CheA as a regulated enzyme with two substrates, P1 and ATP. The model enables us to search the hypothesis space systematically for the simplest possible regulation mechanism consistent with the available data. Our analysis reveals a novel dual regulation mechanism wherein besides regulating ATP binding the receptor activity has to regulate one other key reaction, either P1 binding or phosphotransfer between P1 and ATP. Furthermore, our study shows that the receptors only control kinetic rates of the enzyme without changing its equilibrium properties. Predictions are made for future experiments to distinguish the remaining two dual-regulation mechanisms. This systems-biology approach of combining modeling and a large input-output data-set should be applicable for studying other complex biological processes.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Sarah Simmons ◽  
Grady Wier ◽  
Antonio Pedraza ◽  
Mark Stibich

Abstract Background The role of the environment in hospital acquired infections is well established. We examined the impact on the infection rate for hospital onset Clostridioides difficile (HO-CDI) of an environmental hygiene intervention in 48 hospitals over a 5 year period using a pulsed xenon ultraviolet (PX-UV) disinfection system. Methods Utilization data was collected directly from the automated PX-UV system and uploaded in real time to a database. HO-CDI data was provided by each facility. Data was analyzed at the unit level to determine compliance to disinfection protocols. Final data set included 5 years of data aggregated to the facility level, resulting in a dataset of 48 hospitals and a date range of January 2015–December 2019. Negative binomial regression was used with an offset on patient days to convert infection count data and assess HO-CDI rates vs. intervention compliance rate, total successful disinfection cycles, and total rooms disinfected. The K-Nearest Neighbor (KNN) machine learning algorithm was used to compare intervention compliance and total intervention cycles to presence of infection. Results All regression models depict a statistically significant inverse association between the intervention and HO-CDI rates. The KNN model predicts the presence of infection (or whether an infection will be present or not) with greater than 98% accuracy when considering both intervention compliance and total intervention cycles. Conclusions The findings of this study indicate a strong inverse relationship between the utilization of the pulsed xenon intervention and HO-CDI rates.


2020 ◽  
Author(s):  
Ayesha Sania ◽  
Nicolò Pini ◽  
Morgan E. Nelson ◽  
Michael M. Myers ◽  
Lauren C. Shuffrey ◽  
...  

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.


2020 ◽  
Vol 9 (2) ◽  
pp. e188922128
Author(s):  
Fábio Nogueira da Silva ◽  
João Viana Fonseca Neto

A heuristic for tuning and convergence analysis of the reinforcement learning algorithm for control with output feedback with only input / output data generated by a model is presented. To promote convergence analysis, it is necessary to perform the parameter adjustment in the algorithms used for data generation, and iteratively solve the control problem. A heuristic is proposed to adjust the data generator parameters creating surfaces to assist in the convergence and robustness analysis process of the optimal online control methodology. The algorithm tested is the discrete linear quadratic regulator (DLQR) with output feedback, based on reinforcement learning algorithms through temporal difference learning in the policy iteration scheme to determine the optimal policy using input / output data only. In the policy iteration algorithm, recursive least squares (RLS) is used to estimate online parameters associated with output feedback DLQR. After applying the proposed tuning heuristics, the influence of the parameters could be clearly seen, and the convergence analysis facilitated.


Geophysics ◽  
2005 ◽  
Vol 70 (1) ◽  
pp. S1-S17 ◽  
Author(s):  
Alison E. Malcolm ◽  
Maarten V. de Hoop ◽  
Jérôme H. Le Rousseau

Reflection seismic data continuation is the computation of data at source and receiver locations that differ from those in the original data, using whatever data are available. We develop a general theory of data continuation in the presence of caustics and illustrate it with three examples: dip moveout (DMO), azimuth moveout (AMO), and offset continuation. This theory does not require knowledge of the reflector positions. We construct the output data set from the input through the composition of three operators: an imaging operator, a modeling operator, and a restriction operator. This results in a single operator that maps directly from the input data to the desired output data. We use the calculus of Fourier integral operators to develop this theory in the presence of caustics. For both DMO and AMO, we compute impulse responses in a constant-velocity model and in a more complicated model in which caustics arise. This analysis reveals errors that can be introduced by assuming, for example, a model with a constant vertical velocity gradient when the true model is laterally heterogeneous. Data continuation uses as input a subset (common offset, common angle) of the available data, which may introduce artifacts in the continued data. One could suppress these artifacts by stacking over a neighborhood of input data (using a small range of offsets or angles, for example). We test data continuation on synthetic data from a model known to generate imaging artifacts. We show that stacking over input scattering angles suppresses artifacts in the continued data.


2012 ◽  
Vol 220-223 ◽  
pp. 2264-2268 ◽  
Author(s):  
Dong Dong Wang ◽  
You Jun Chen ◽  
Hai Jie Pang

In order to solve function approximation, a mathematic model of Rational Function Functional Networks (RFFN) based on approximation was proposed and the learning algorithm for function approximation was presented. This algorithm used the lease square method thought and constructed auxiliary function by Lagrange multiplier method, and the parameters of the rational function functional networks were determined by solving a system of linear equations. Results illustrate the effectiveness of the rational function functional networks in solving approximation problems of the function with a pole.


1997 ◽  
Vol 9 (6) ◽  
pp. 1381-1402 ◽  
Author(s):  
Kwabena Agyepong ◽  
Ravi Kothari

We investigate the effects of including selected lateral interconnections in a feedforward neural network. In a network with one hidden layer consisting of m hidden neurons labeled 1,2… m, hidden neuron j is connected fully to the inputs, the outputs, and hidden neuron j + 1. As a consequence of the lateral connections, each hidden neuron receives two error signals: one from the output layer and one through the lateral interconnection. We show that the use of these lateral interconnections among the hidden-layer neurons facilitates controlled assignment of role and specialization of the hidden-layer neurons. In particular, we show that as training progresses, hidden neurons become progressively specialized—starting from the fringes (i.e., lower and higher numbered hidden neurons, e.g., 1, 2, m — 1 m) and leaving the neurons in the center of the hidden layer (i.e., hidden-layer neurons numbered close to m/2) unspecialized or functionally identical. Consequently, the network behaves like network growing algorithms without the explicit need to add hidden units, and like soft weight sharing due to functionally identical neurons in the center of the hidden layer. Experimental results from one classification and one function approximation problems are presented to illustrate selective specialization of the hidden-layer neurons. In addition, the improved generalization that results from a decrease in the effective number of free parameters is illustrated through a simple function approximation example and with a real-world data set. Besides the reduction in the number of free parameters, the localization of weight sharing may also allow for a method that allows procedural determination for the number of hidden-layer neurons required for a given learning task.


2021 ◽  
Vol 2078 (1) ◽  
pp. 012034
Author(s):  
Xuemei Hou ◽  
Fei Gao ◽  
Jianping Wu ◽  
Minghui Wu

Abstract The traditional hepaticcell carcinoma (HCC) pathological grading depends on biopsy, which will cause damage to the patient's body and is not suitable for everyone's pathological grading diagnosis. The purpose of this paper is to study the pathological grading of liver tumors on MRI images by using deep learning algorithm, so as to further improve the accuracy of HCC pathological grading. An improved network model based on SE-DenseNet is proposed. The nonlinear mapping relationship between feature channels is modeled and recalibrated using attention mechanism, and rich deep-seated features are extracted, so as to improve the feature expression ability of the network. The method proposed in this paper is verified on the data set including 197 patients, including 130 training sets and 67 test sets. The experimental results are evaluated by receiver operating characteristic (ROC) and area under the ROC curve (AUC). The improved SE-Densenet network achieves good results, and AUC 0.802 is obtained on the test set. The experimental results show that the method proposed in this paper can well predict the pathological grade of HCC.


Sign in / Sign up

Export Citation Format

Share Document