scholarly journals Functional random forests for curve response

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Guifang Fu ◽  
Xiaotian Dai ◽  
Yeheng Liang

AbstractThe rapid advancement of functional data in various application fields has increased the demand for advanced statistical approaches that can incorporate complex structures and nonlinear associations. In this article, we propose a novel functional random forests (FunFor) approach to model the functional data response that is densely and regularly measured, as an extension of the landmark work of Breiman, who introduced traditional random forests for a univariate response. The FunFor approach is able to predict curve responses for new observations and selects important variables from a large set of scalar predictors. The FunFor approach inherits the efficiency of the traditional random forest approach in detecting complex relationships, including nonlinear and high-order interactions. Additionally, it is a non-parametric approach without the imposition of parametric and distributional assumptions. Eight simulation settings and one real-data analysis consistently demonstrate the excellent performance of the FunFor approach in various scenarios. In particular, FunFor successfully ranks the true predictors as the most important variables, while achieving the most robust variable sections and the smallest prediction errors when comparing it with three other relevant approaches. Although motivated by a biological leaf shape data analysis, the proposed FunFor approach has great potential to be widely applied in various fields due to its minimal requirement on tuning parameters and its distribution-free and model-free nature. An R package named ’FunFor’, implementing the FunFor approach, is available at GitHub.

2018 ◽  
Vol 2018 ◽  
pp. 1-13 ◽  
Author(s):  
Laura Millán-Roures ◽  
Irene Epifanio ◽  
Vicente Martínez

A functional data analysis (FDA) based methodology for detecting anomalous flows in urban water networks is introduced. Primary hydraulic variables are recorded in real-time by telecontrol systems, so they are functional data (FD). In the first stage, the data are validated (false data are detected) and reconstructed, since there could be not only false data, but also missing and noisy data. FDA tools are used such as tolerance bands for FD and smoothing for dense and sparse FD. In the second stage, functional outlier detection tools are used in two phases. In Phase I, the data are cleared of anomalies to ensure that data are representative of the in-control system. The objective of Phase II is system monitoring. A new functional outlier detection method is also proposed based on archetypal analysis. The methodology is applied and illustrated with real data. A simulated study is also carried out to assess the performance of the outlier detection techniques, including our proposal. The results are very promising.


Author(s):  
Pedro M. Esperança ◽  
Dari F. Da ◽  
Ben Lambert ◽  
Roch K. Dabiré ◽  
Thomas S. Churcher

AbstractNear infrared spectroscopy is increasingly being used as an economical method to monitor mosquito vector populations in support of disease control. Despite this rise in popularity, strong geographical variation in spectra has proven an issue for generalising predictions from one location to another. Here, we use a functional data analysis approach—which models spectra as smooth curves rather than as a discrete set of points—to develop a method that is robust to geographic heterogeneity. Specifically, we use a penalised generalised linear modelling framework which includes efficient functional representation of spectra, spectral smoothing and regularisation. To ensure better generalisation of model predictions from one training set to another, we use cross-validation procedures favouring smoother representation of spectra. To illustrate the performance of our approach, we collected spectra for field-caught specimens of Anopheles gambiae complex mosquitoes – the most epidemiologically important vector species on the planet – in two sites in Burkina Faso. Using these spectra, we show how models trained on data from one site can successfully classify morphologically identical sibling species in another site, over 250km away. Whilst we apply our framework to species prediction, our unified statistical framework can, alternatively, handle regression analysis (for example, to determine mosquito age) and other types of multinomial classification (for example, to determine infection status). To make our methods readily available for field entomologists, we have created an open-source R package mlevcm. All data used is publicly also available.


Mathematics ◽  
2021 ◽  
Vol 9 (23) ◽  
pp. 3074
Author(s):  
Cristian Preda ◽  
Quentin Grimonprez ◽  
Vincent Vandewalle

Categorical functional data represented by paths of a stochastic jump process with continuous time and a finite set of states are considered. As an extension of the multiple correspondence analysis to an infinite set of variables, optimal encodings of states over time are approximated using an arbitrary finite basis of functions. This allows dimension reduction, optimal representation, and visualisation of data in lower dimensional spaces. The methodology is implemented in the cfda R package and is illustrated using a real data set in the clustering framework.


Author(s):  
Timothy McMurry ◽  
Dimitris Politis

This article examines the current state of methodological and practical developments for resampling inference techniques in functional data analysis, paying special attention to situations where either the data and/or the parameters being estimated take values in a space of functions. It first provides the basic background and notation before discussing bootstrap results from nonparametric smoothing, taking into account confidence bands in density estimation as well as confidence bands in nonparametric regression and autoregression. It then considers the major results in subsampling and what is known about bootstraps, along with a few recent real-data applications of bootstrapping with functional data. Finally, it highlights possible directions for further research and exploration.


2020 ◽  
Vol 45 (6) ◽  
pp. 719-749
Author(s):  
Eduardo Doval ◽  
Pedro Delicado

We propose new methods for identifying and classifying aberrant response patterns (ARPs) by means of functional data analysis. These methods take the person response function (PRF) of an individual and compare it with the pattern that would correspond to a generic individual of the same ability according to the item-person response surface. ARPs correspond to atypical difference functions. The ARP classification is done with functional data clustering applied to the PRFs identified as ARP. We apply these methods to two sets of simulated data (the first is used to illustrate the ARP identification methods and the second demonstrates classification of the response patterns flagged as ARP) and a real data set (a Grade 12 science assessment test, SAT, with 32 items answered by 600 examinees). For comparative purposes, ARPs are also identified with three nonparametric person-fit indices (Ht, Modified Caution Index, and ZU3). Our results indicate that the ARP detection ability of one of our proposed methods is comparable to that of person-fit indices. Moreover, the proposed classification methods enable ARP associated with either spuriously low or spuriously high scores to be distinguished.


Methodology ◽  
2021 ◽  
Vol 17 (2) ◽  
pp. 149-167
Author(s):  
Mark Stemmler ◽  
Jörg-Henrik Heine ◽  
Susanne Wallner

Configural Frequency Analysis (CFA) is a useful statistical method for the analysis of multiway contingency tables and an appropriate tool for person-oriented or person-centered methods. In complex contingency tables, patterns or configurations are analyzed by comparing observed cell frequencies with expected frequencies. Significant differences between observed and expected frequencies lead to the emergence of Types and Antitypes. Types are patterns or configurations which are significantly more often observed than the expected frequencies; Antitypes represent configurations which are observed less frequently than expected. The R-package confreq is an easy-to-use software for conducting CFAs; another useful shareware to run CFAs was developed by Alexander von Eye. Here, CFA is presented based on the log-linear modeling approach. CFA may be used together with interval level variables which can be added as covariates into the design matrix. In this article, a real data example and the use of confreq are presented. In sum, the use of a covariate may bring the estimated cell frequencies closer to the observed cell frequencies. In those cases, the number of Types or Antitypes may decrease. However, in rare cases, the Type-Antitype pattern can change with new emerging Types or Antitypes.


Geophysics ◽  
2021 ◽  
pp. 1-48
Author(s):  
Leonardo Azevedo

In subsurface modelling and characterization, predicting the spatial distribution of subsurface elastic properties is commonly achieved by seismic inversion. Stochastic seismic inversion methods, such as iterative geostatistical seismic inversion, are widely applied to this end. Global iterative geostatistical seismic inversion methods are computationally expensive as they require, at a given iteration, the stochastic sequential simulation of the entire inversion grid at once multiple times. Functional data analysis is a well-established statistical method suited to model long-term and noisy temporal series. This method allows to summarize spatiotemporal series in a set of analytical functions with a low-dimension representation. Functional data analysis has been recently extended to problems related to geosciences, but its application to geophysics is still limited. We propose the use functional data analysis as a model reduction technique during the model perturbation step in global iterative geostatistical seismic inversion. Functional data analysis is used to collapse the vertical dimension of the inversion grid. We illustrate the proposed hybrid inversion method with its application to three-dimensional synthetic and real data sets. The results show the ability of the proposed inversion methodology to predict smooth inverted subsurface models that match the observed data at a similar convergence as obtained by a global iterative geostatistical seismic inversion, but with a considerable decrease in the computational cost. While the resolution of the inverted models might not be enough for a detailed subsurface characterization, the inverted models can be used as starting point of global iterative geostatistical seismic inversion to speed-up the inversion or to test alternative geological scenarios by changing the inversion parameterization and obtaining inverted models in a relatively short time.


2020 ◽  
Vol 36 (14) ◽  
pp. 4222-4224
Author(s):  
Zhong Wang ◽  
Nating Wang ◽  
Zilu Wang ◽  
Libo Jiang ◽  
Yaqun Wang ◽  
...  

Abstract Summary Genome-wide association studies (GWAS), particularly designed with thousands and thousands of single-nucleotide polymorphisms (SNPs) (big p) genotyped on tens of thousands of subjects (small n), are encountered by a major challenge of p ≪ n. Although the integration of longitudinal information can significantly enhance a GWAS’s power to comprehend the genetic architecture of complex traits and diseases, an additional challenge is generated by an autocorrelative process. We have developed several statistical models for addressing these two challenges by implementing dimension reduction methods and longitudinal data analysis. To make these models computationally accessible to applied geneticists, we wrote an R package of computer software, HiGwas, designed to analyze longitudinal GWAS datasets. Functions in the package encompass single SNP analyses, significance-level adjustment, preconditioning and model selection for a high-dimensional set of SNPs. HiGwas provides the estimates of genetic parameters and the confidence intervals of these estimates. We demonstrate the features of HiGwas through real data analysis and vignette document in the package. Availability and implementation https://github.com/wzhy2000/higwas. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document