scholarly journals Research directions in data wrangling: Visualizations and transformations for usable and credible data

2011 ◽  
Vol 10 (4) ◽  
pp. 271-288 ◽  
Author(s):  
Sean Kandel ◽  
Jeffrey Heer ◽  
Catherine Plaisant ◽  
Jessie Kennedy ◽  
Frank van Ham ◽  
...  

In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of ‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration arelongstanding issues in the database community, relatively little research has explored how interactive visualization can advance the state of the art. In this article, we review the challenges and opportunities associated with addressing data quality issues. We argue that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization. We identify a number of outstanding research questions, including how appropriate visual encodings can facilitate apprehension of missing data, discrepant values, and uncertainty; how interactive visualizations might facilitate data transform specification; and how recorded provenance and social interaction might enable wider reuse, verification, and modification of data transformations.

Author(s):  
Justin Leiby ◽  
Kristina M. Rennekamp ◽  
Ken T. Trotman

We survey experienced experimental researchers to understand their beliefs about the biggest challenges facing audit JDM research. By far, the biggest challenge identified by respondents is access to experienced participants. This creates a major problem as examining important research questions often requires hard-to-access professionals and the availability of these participants has decreased over time. Other important challenges to audit JDM research include the publication process (including demands for multiple experiments in a single study involving experienced participants) and demonstrating practical contributions. We also compare responses about the challenges facing financial and managerial accounting researchers, in order to better understand the problems that are unique to audit researchers. We discuss how the challenges identified might be either mitigated or exacerbated by the use of various online platforms. We discuss data quality issues and potential solutions, provide suggestions on potential new sources of participants and possible ways forward for audit JDM research.


2020 ◽  
Author(s):  
Maryam Zolnoori ◽  
Mark D Williams ◽  
William B Leasure ◽  
Kurt B Angstman ◽  
Che Ngufor

BACKGROUND Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. OBJECTIVE This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. METHODS The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. RESULTS Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. CONCLUSIONS The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. INTERNATIONAL REGISTERED REPORT DERR1-10.2196/18366


10.2196/18366 ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. e18366
Author(s):  
Maryam Zolnoori ◽  
Mark D Williams ◽  
William B Leasure ◽  
Kurt B Angstman ◽  
Che Ngufor

Background Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. Objective This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. Methods The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. Results Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. Conclusions The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. International Registered Report Identifier (IRRID) DERR1-10.2196/18366


2021 ◽  
Vol 11 (21) ◽  
pp. 9884
Author(s):  
Ahmad Mel ◽  
Bo Kang ◽  
Jefrey Lijffijt ◽  
Tijl De Bie

Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efficacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these data quality issues. We then introduce FONDUE, a framework for utilizing network embedding methods for data-driven disambiguation and deduplication of nodes. Given an undirected and unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-NDD identifies nodes that appear to correspond to same entity for merging (node deduplication), using only the network topology. From controlled experiments on benchmark networks, we find that FONDUE-NDA is substantially and consistently more accurate with lower computational cost in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives.


2020 ◽  
Vol 14 (4) ◽  
pp. 668-681
Author(s):  
Wissam Mammar Kouadri ◽  
Mourad Ouziri ◽  
Salima Benbernou ◽  
Karima Echihabi ◽  
Themis Palpanas ◽  
...  

In this paper, we present a comprehensive study that evaluates six state-of-the-art sentiment analysis tools on five public datasets, based on the quality of predictive results in the presence of semantically equivalent documents, i.e., how consistent existing tools are in predicting the polarity of documents based on paraphrased text. We observe that sentiment analysis tools exhibit intra-tool inconsistency , which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency , which is the prediction of different polarity for semantically equivalent documents across different tools. We introduce a heuristic to assess the data quality of an augmented dataset and a new set of metrics to evaluate tool inconsistencies. Our results indicate that tool inconsistencies is still an open problem, and they point towards promising research directions and accuracy improvements that can be obtained if such inconsistencies are resolved.


2014 ◽  
Vol 34 ◽  
pp. 1-14 ◽  
Author(s):  
Laura Sabourin

In this article, I review the use of the functional magnetic resonance imaging (fMRI) technique to investigate the bilingual brain. Specifically, this review will discuss the types of research questions that can be (and have been) answered using this specific methodology, as well as questions this technique cannot answer. The review will then focus on providing a recent overview of fMRI studies of the bilingual mental lexicon, bilingual sentence processing, and the bilingual advantage in cognitive control. The pros and cons of this technique will be discussed in detail. This review will end with a discussion of the state of the art in the field of bilingual brain research and will provide avenues for future research directions to continue investigating the bilingual brain.


Author(s):  
Jacques Thomassen ◽  
Carolien van Ham

This chapter presents the research questions and outline of the book, providing a brief review of the state of the art of legitimacy research in established democracies, and discusses the recurring theme of crisis throughout this literature since the 1960s. It includes a discussion of the conceptualization and measurement of legitimacy, seeking to relate legitimacy to political support, and reflecting on how to evaluate empirical indicators: what symptoms indicate crisis? This chapter further explains the structure of the three main parts of the book. Part I evaluates in a systematic fashion the empirical evidence for legitimacy decline in established democracies; Part II reappraises the validity of theories of legitimacy decline; and Part II investigates what (new) explanations can account for differences in legitimacy between established democracies. The chapter concludes with a short description of the chapters included in the volume.


Author(s):  
Akrati Saxena ◽  
Harita Reddy

AbstractOnline informal learning and knowledge-sharing platforms, such as Stack Exchange, Reddit, and Wikipedia have been a great source of learning. Millions of people access these websites to ask questions, answer the questions, view answers, or check facts. However, one interesting question that has always attracted the researchers is if all the users share equally on these portals, and if not then how the contribution varies across users, and how it is distributed? Do different users focus on different kinds of activities and play specific roles? In this work, we present a survey of users’ social roles that have been identified on online discussion and Q&A platforms including Usenet newsgroups, Reddit, Stack Exchange, and MOOC forums, as well as on crowdsourced encyclopedias, such as Wikipedia, and Baidu Baike, where users interact with each other through talk pages. We discuss the state of the art on capturing the variety of users roles through different methods including the construction of user network, analysis of content posted by users, temporal analysis of user activity, posting frequency, and so on. We also discuss the available datasets and APIs to collect the data from these platforms for further research. The survey is concluded with open research questions.


2021 ◽  
pp. 100619
Author(s):  
Jacek Rak ◽  
Rita Girão-Silva ◽  
Teresa Gomes ◽  
Georgios Ellinas ◽  
Burak Kantarci ◽  
...  

Energies ◽  
2021 ◽  
Vol 14 (13) ◽  
pp. 3800
Author(s):  
Sebastian Krapf ◽  
Nils Kemmerzell ◽  
Syed Khawaja Haseeb Khawaja Haseeb Uddin ◽  
Manuel Hack Hack Vázquez ◽  
Fabian Netzler ◽  
...  

Roof-mounted photovoltaic systems play a critical role in the global transition to renewable energy generation. An analysis of roof photovoltaic potential is an important tool for supporting decision-making and for accelerating new installations. State of the art uses 3D data to conduct potential analyses with high spatial resolution, limiting the study area to places with available 3D data. Recent advances in deep learning allow the required roof information from aerial images to be extracted. Furthermore, most publications consider the technical photovoltaic potential, and only a few publications determine the photovoltaic economic potential. Therefore, this paper extends state of the art by proposing and applying a methodology for scalable economic photovoltaic potential analysis using aerial images and deep learning. Two convolutional neural networks are trained for semantic segmentation of roof segments and superstructures and achieve an Intersection over Union values of 0.84 and 0.64, respectively. We calculated the internal rate of return of each roof segment for 71 buildings in a small study area. A comparison of this paper’s methodology with a 3D-based analysis discusses its benefits and disadvantages. The proposed methodology uses only publicly available data and is potentially scalable to the global level. However, this poses a variety of research challenges and opportunities, which are summarized with a focus on the application of deep learning, economic photovoltaic potential analysis, and energy system analysis.


Sign in / Sign up

Export Citation Format

Share Document