scholarly journals Unique variable analysis: A novel approach for detecting redundant variables in multivariate data

2020 ◽  
Author(s):  
Alexander P. Christensen ◽  
Luis Eduardo Garrido ◽  
Hudson Golino

One common approach for constructing tests that measure a single attribute is the semantic similarity approach where items vary slightly in their wording and content. Despite being an effective strategy for ensuring high internal consistency, the information in tests may become redundant or worse confound the interpretation of the test scores. With the advent of network models, where tests represent a complex system and components (usually items) represent causally autonomous features, redundant variables may have inadvertent effects on the interpretation of their metrics. These issues motivated the development of a novel approach called Unique Variable Analysis (UVA), which detects redundant variables in multivariate data. The goal of UVA is to statistically identify potential redundancies in multivariate data so that researchers can make decisions about how best to handle them. Using a Monte Carlo simulation approach, we generated multivariate data with redundancies that were based on examples of known real-world redundancies. We then demonstrate the effects that redundancy can have on the accurate estimation of dimensions. Next, we evaluated UVA’s ability to detect redundant variables in the simulated data. Based on these results, we provide a tutorial for how to apply UVA to real-world data. Our example data demonstrate that redundant variables create inaccurate estimates of dimensional structure but after applying UVA, the expected structure can be recovered. In sum, our study suggests that redundancy can have substantial effects on validity if left unchecked and that redundancy assessment should be integrated into standard validation practices.

Author(s):  
Marcelo N. de Sousa ◽  
Ricardo Sant’Ana ◽  
Rigel P. Fernandes ◽  
Julio Cesar Duarte ◽  
José A. Apolinário ◽  
...  

AbstractIn outdoor RF localization systems, particularly where line of sight can not be guaranteed or where multipath effects are severe, information about the terrain may improve the position estimate’s performance. Given the difficulties in obtaining real data, a ray-tracing fingerprint is a viable option. Nevertheless, although presenting good simulation results, the performance of systems trained with simulated features only suffer degradation when employed to process real-life data. This work intends to improve the localization accuracy when using ray-tracing fingerprints and a few field data obtained from an adverse environment where a large number of measurements is not an option. We employ a machine learning (ML) algorithm to explore the multipath information. We selected algorithms random forest and gradient boosting; both considered efficient tools in the literature. In a strict simulation scenario (simulated data for training, validating, and testing), we obtained the same good results found in the literature (error around 2 m). In a real-world system (simulated data for training, real data for validating and testing), both ML algorithms resulted in a mean positioning error around 100 ,m. We have also obtained experimental results for noisy (artificially added Gaussian noise) and mismatched (with a null subset of) features. From the simulations carried out in this work, our study revealed that enhancing the ML model with a few real-world data improves localization’s overall performance. From the machine ML algorithms employed herein, we also observed that, under noisy conditions, the random forest algorithm achieved a slightly better result than the gradient boosting algorithm. However, they achieved similar results in a mismatch experiment. This work’s practical implication is that multipath information, once rejected in old localization techniques, now represents a significant source of information whenever we have prior knowledge to train the ML algorithm.


2020 ◽  
Vol 19 (2) ◽  
pp. 21-35
Author(s):  
Ryan Beal ◽  
Timothy J. Norman ◽  
Sarvapali D. Ramchurn

AbstractThis paper outlines a novel approach to optimising teams for Daily Fantasy Sports (DFS) contests. To this end, we propose a number of new models and algorithms to solve the team formation problems posed by DFS. Specifically, we focus on the National Football League (NFL) and predict the performance of real-world players to form the optimal fantasy team using mixed-integer programming. We test our solutions using real-world data-sets from across four seasons (2014-2017). We highlight the advantage that can be gained from using our machine-based methods and show that our solutions outperform existing benchmarks, turning a profit in up to 81.3% of DFS game-weeks over a season.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Charles Marks ◽  
Arash Jahangiri ◽  
Sahar Ghanipoor Machiani

Every year, over 50 million people are injured and 1.35 million die in traffic accidents. Risky driving behaviors are responsible for over half of all fatal vehicle accidents. Identifying risky driving behaviors within real-world driving (RWD) datasets is a promising avenue to reduce the mortality burden associated with these unsafe behaviors, but numerous technical hurdles must be overcome to do so. Herein, we describe the implementation of a multistage process for classifying unlabeled RWD data as potentially risky or not. In the first stage, data are reformatted and reduced in preparation for classification. In the second stage, subsets of the reformatted data are labeled as potentially risky (or not) using the Iterative-DBSCAN method. In the third stage, the labeled subsets are then used to fit random forest (RF) classification models—RF models were chosen after they were found to be performing better than logistic regression and artificial neural network models. In the final stage, the RF models are used predictively to label the remaining RWD data as potentially risky (or not). The implementation of each stage is described and analyzed for the classification of RWD data from vehicles on public roads in Ann Arbor, Michigan. Overall, we identified 22.7 million observations of potentially risky driving out of 268.2 million observations. This study provides a novel approach for identifying potentially risky driving behaviors within RWD datasets. As such, this study represents an important step in the implementation of protocols designed to address and prevent the harms associated with risky driving.


Author(s):  
Juheng Zhang ◽  
Xiaoping Liu ◽  
Xiao-Bai Li

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.


2020 ◽  
Author(s):  
Leonardo Andrade Ribeiro ◽  
Felipe Ferreira Borges ◽  
Diego Junior do Carmo Oliveira

Set similarity join, which finds all pairs of similar sets in a collection, plays an important role in data cleaning and integration. Many algorithms have been proposed to efficiently answer set similarity join on single-attribute data. However, real-world data often contain multiple attributes. In this paper, we propose a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then present a simple, yet effective filter based on lightweight indexes, for which exact and probabilistic implementation alternatives are evaluated. Finally, we devise a cost model to identify the best attribute ordering to reduce processing time. Our experimental results show that our approach is effective and significantly outperforms previous work.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Sean Deering ◽  
Abhishek Pratap ◽  
Christine Suver ◽  
A. Joseph Borelli ◽  
Adam Amdur ◽  
...  

AbstractConducting biomedical research using smartphones is a novel approach to studying health and disease that is only beginning to be meaningfully explored. Gathering large-scale, real-world data to track disease manifestation and long-term trajectory in this manner is quite practical and largely untapped. Researchers can assess large study cohorts using surveys and sensor-based activities that can be interspersed with participants’ daily routines. In addition, this approach offers a medium for researchers to collect contextual and environmental data via device-based sensors, data aggregator frameworks, and connected wearable devices. The main aim of the SleepHealth Mobile App Study (SHMAS) was to gain a better understanding of the relationship between sleep habits and daytime functioning utilizing a novel digital health approach. Secondary goals included assessing the feasibility of a fully-remote approach to obtaining clinical characteristics of participants, evaluating data validity, and examining user retention patterns and data-sharing preferences. Here, we provide a description of data collected from 7,250 participants living in the United States who chose to share their data broadly with the study team and qualified researchers worldwide.


2020 ◽  
pp. 001316442092656
Author(s):  
Yutian T. Thompson ◽  
Hairong Song ◽  
Dexin Shi ◽  
Zhengkui Liu

Conventional approaches for selecting a reference indicator (RI) could lead to misleading results in testing for measurement invariance (MI). Several newer quantitative methods have been available for more rigorous RI selection. However, it is still unknown how well these methods perform in terms of correctly identifying a truly invariant item to be an RI. Thus, Study 1 was designed to address this issue in various conditions using simulated data. As a follow-up, Study 2 further investigated the advantages/disadvantages of using RI-based approaches for MI testing in comparison with non-RI-based approaches. Altogether, the two studies provided a solid examination on how RI matters in MI tests. In addition, a large sample of real-world data was used to empirically compare the uses of the RI selection methods as well as the RI-based and non-RI-based approaches for MI testing. In the end, we offered a discussion on all these methods, followed by suggestions and recommendations for applied researchers.


Author(s):  
Arjan Voogt ◽  
Harish Pillai ◽  
Robert Seah

Due to the resonance behavior of roll motions, roll damping is an important consideration for vessel motions and associated extreme and fatigue loading on the hull, topsides and risers of an FPSO. In many cases radiation damping is limited and passive damping devices such as bilge keels are installed to spur viscous eddies and hence limit the roll motions. This contributes nonlinear damping to an already complex problem. Designers often rely on model tests to assess this damping. Based on test results, empirical and semi-empirical estimation models have been developed for different ship types and are available in current literature, but examples of benchmark validation with real world data are limited. These benchmarks are often hindered by uncertainty in the observed weather conditions, vessel loading conditions and vessel heading with respect to the waves. This paper discusses these challenges and introduces a novel approach used to characterize the actual roll damping for an FPSO under real world conditions. The assumptions, methodology and results will be discussed in this paper. In this study, 5 years of hindcast weather data is examined along with FPSO heading and roll motion measurements. The roll damping characteristics of this FPSO was expected to change over the course of the measurements and the study documents the actual variation of roll damping under various conditions over this period.


Sign in / Sign up

Export Citation Format

Share Document