scholarly journals Wrangling categorical data in R

Author(s):  
Amelia McNamara ◽  
Nicholas J Horton

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.

2017 ◽  
Author(s):  
Amelia McNamara ◽  
Nicholas J Horton

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.


2017 ◽  
Author(s):  
Amelia McNamara ◽  
Nicholas J Horton

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.


2021 ◽  
pp. 1-27 ◽  
Author(s):  
Brandon de la Cuesta ◽  
Naoki Egami ◽  
Kosuke Imai

Abstract Conjoint analysis has become popular among social scientists for measuring multidimensional preferences. When analyzing such experiments, researchers often focus on the average marginal component effect (AMCE), which represents the causal effect of a single profile attribute while averaging over the remaining attributes. What has been overlooked, however, is the fact that the AMCE critically relies upon the distribution of the other attributes used for the averaging. Although most experiments employ the uniform distribution, which equally weights each profile, both the actual distribution of profiles in the real world and the distribution of theoretical interest are often far from uniform. This mismatch can severely compromise the external validity of conjoint analysis. We empirically demonstrate that estimates of the AMCE can be substantially different when averaging over the target profile distribution instead of uniform. We propose new experimental designs and estimation methods that incorporate substantive knowledge about the profile distribution. We illustrate our methodology through two empirical applications, one using a real-world distribution and the other based on a counterfactual distribution motivated by a theoretical consideration. The proposed methodology is implemented through an open-source software package.


2021 ◽  
Vol 22 (S6) ◽  
Author(s):  
Yasmine Mansour ◽  
Annie Chateau ◽  
Anna-Sophie Fiston-Lavier

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yiqing Zhao ◽  
Saravut J. Weroha ◽  
Ellen L. Goode ◽  
Hongfang Liu ◽  
Chen Wang

Abstract Background Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in the clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients’ genetic information and further associate treatment decisions with genetic information. Methods We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated in a Foundation-tested women cancer cohort (N = 196). Upon retrieval of patients’ genetic information using NLP system, we assessed the completeness of genetic data captured in unstructured clinical notes according to a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients’ treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. Results We identified seven topics in the clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance. Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, the capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies. Conclusions In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issues such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate the real-world utility of genetic information to initiate a prescription of targeted therapy.


Robotics ◽  
2021 ◽  
Vol 10 (2) ◽  
pp. 68
Author(s):  
Lei Shi ◽  
Cosmin Copot ◽  
Steve Vanlanduit

In gaze-based Human-Robot Interaction (HRI), it is important to determine human visual intention for interacting with robots. One typical HRI interaction scenario is that a human selects an object by gaze and a robotic manipulator will pick up the object. In this work, we propose an approach, GazeEMD, that can be used to detect whether a human is looking at an object for HRI application. We use Earth Mover’s Distance (EMD) to measure the similarity between the hypothetical gazes at objects and the actual gazes. Then, the similarity score is used to determine if the human visual intention is on the object. We compare our approach with a fixation-based method and HitScan with a run length in the scenario of selecting daily objects by gaze. Our experimental results indicate that the GazeEMD approach has higher accuracy and is more robust to noises than the other approaches. Hence, the users can lessen cognitive load by using our approach in the real-world HRI scenario.


2010 ◽  
Vol 72 (1) ◽  
pp. 24-29 ◽  
Author(s):  
Hui-Min Chung ◽  
Kristina Jackson Behan

Authentic assessment exercises are similar to real-world tasks that would be expected by a professional. An authentic assessment in combination with an inquiry-based learning activity enhances students' learning and rehearses them for their future roles, whether as scientists or as informed citizens. Over a period of 2 years, we experimented with two inquiry-based projects; one had traditional scientific inquiry characteristics, and the other used popular culture as the medium of inquiry. We found that activities that incorporated group learning motivated students and sharpened their abilities to apply and communicate their knowledge of science. We also discovered that incorporating popular culture provided ““Millennial”” students with a refreshing view of science learning and increased their appetites to explore and elaborate on science.


2001 ◽  
Vol 33 (4) ◽  
pp. 657-660
Author(s):  
Frank Tachau

Based on Professor Özbudun's lectures at Bilkent University, this book is at once compact, highly readable, and very insightful. Unlike much current literature on Turkey, the analysis is set in a broad and informed comparative context. Turkey, the author points out, has been left out of comparative political studies, particularly those encompassing the Middle East and southern Europe, in which arguably it could (or should) have been included. Scholarly neglect thus reflects real-world politics: Turkey falls between two worlds, one of which it once largely controlled, the other to which it currently aspires to belong. This lacuna is one of the factors that persuaded Özbudun to publish this volume.


Algorithms ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 197
Author(s):  
Ali Seman ◽  
Azizian Mohd Sapawi

In the conventional k-means framework, seeding is the first step toward optimization before the objects are clustered. In random seeding, two main issues arise: the clustering results may be less than optimal and different clustering results may be obtained for every run. In real-world applications, optimal and stable clustering is highly desirable. This report introduces a new clustering algorithm called the zero k-approximate modal haplotype (Zk-AMH) algorithm that uses a simple and novel seeding mechanism known as zero-point multidimensional spaces. The Zk-AMH provides cluster optimality and stability, therefore resolving the aforementioned issues. Notably, the Zk-AMH algorithm yielded identical mean scores to maximum, and minimum scores in 100 runs, producing zero standard deviation to show its stability. Additionally, when the Zk-AMH algorithm was applied to eight datasets, it achieved the highest mean scores for four datasets, produced an approximately equal score for one dataset, and yielded marginally lower scores for the other three datasets. With its optimality and stability, the Zk-AMH algorithm could be a suitable alternative for developing future clustering tools.


Sign in / Sign up

Export Citation Format

Share Document