Research directions in data wrangling: Visualizations and transformations for usable and credible data

In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of ‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration arelongstanding issues in the database community, relatively little research has explored how interactive visualization can advance the state of the art. In this article, we review the challenges and opportunities associated with addressing data quality issues. We argue that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization. We identify a number of outstanding research questions, including how appropriate visual encodings can facilitate apprehension of missing data, discrepant values, and uncertainty; how interactive visualizations might facilitate data transform specification; and how recorded provenance and social interaction might enable wider reuse, verification, and modification of data transformations.

Download Full-text

Challenges to Experimental Audit JDM Research and the Role of Online Platforms in Resolving these Challenges

Auditing A Journal of Practice & Theory ◽

10.2308/ajpt-19-105 ◽

2021 ◽

Author(s):

Justin Leiby ◽

Kristina M. Rennekamp ◽

Ken T. Trotman

Keyword(s):

Data Quality ◽

Important Research ◽

Managerial Accounting ◽

Publication Process ◽

Online Platforms ◽

Single Study ◽

Research Questions ◽

Quality Issues ◽

Over Time

We survey experienced experimental researchers to understand their beliefs about the biggest challenges facing audit JDM research. By far, the biggest challenge identified by respondents is access to experienced participants. This creates a major problem as examining important research questions often requires hard-to-access professionals and the availability of these participants has decreased over time. Other important challenges to audit JDM research include the publication process (including demands for multiple experiments in a single study involving experienced participants) and demonstrating practical contributions. We also compare responses about the challenges facing financial and managerial accounting researchers, in order to better understand the problems that are unique to audit researchers. We discuss how the challenges identified might be either mitigated or exacerbated by the use of various online platforms. We discuss data quality issues and potential solutions, provide suggestions on potential new sources of participants and possible ways forward for audit JDM research.

Download Full-text

A Systematic Framework for Analyzing Observation Data in Patient-Centered Registries: Case Study for Patients With Depression (Preprint)

10.2196/preprints.18366 ◽

2020 ◽

Author(s):

Maryam Zolnoori ◽

Mark D Williams ◽

William B Leasure ◽

Kurt B Angstman ◽

Che Ngufor

Keyword(s):

Data Quality ◽

Quality Data ◽

Patient Centered ◽

High Quality ◽

Data Set ◽

High Quality Data ◽

Research Questions ◽

Quality Issues ◽

The Impact ◽

Systematic Framework

BACKGROUND Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. OBJECTIVE This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. METHODS The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. RESULTS Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. CONCLUSIONS The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. INTERNATIONAL REGISTERED REPORT DERR1-10.2196/18366

Download Full-text

A Systematic Framework for Analyzing Observation Data in Patient-Centered Registries: Case Study for Patients With Depression

JMIR Research Protocols ◽

10.2196/18366 ◽

2020 ◽

Vol 9 (10) ◽

pp. e18366

Author(s):

Maryam Zolnoori ◽

Mark D Williams ◽

William B Leasure ◽

Kurt B Angstman ◽

Che Ngufor

Keyword(s):

Data Quality ◽

Quality Data ◽

Patient Centered ◽

High Quality ◽

Data Set ◽

High Quality Data ◽

Research Questions ◽

Quality Issues ◽

The Impact ◽

Systematic Framework

Background Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. Objective This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. Methods The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. Results Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. Conclusions The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. International Registered Report Identifier (IRRID) DERR1-10.2196/18366

Download Full-text

FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings

Applied Sciences ◽

10.3390/app11219884 ◽

2021 ◽

Vol 11 (21) ◽

pp. 9884

Author(s):

Ahmad Mel ◽

Bo Kang ◽

Jefrey Lijffijt ◽

Tijl De Bie

Keyword(s):

Data Quality ◽

State Of The Art ◽

Computational Cost ◽

Real Objects ◽

Unweighted Network ◽

Quality Issues ◽

Main Components ◽

Subsequent Splitting ◽

Network Form ◽

Embedding Methods

Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efficacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these data quality issues. We then introduce FONDUE, a framework for utilizing network embedding methods for data-driven disambiguation and deduplication of nodes. Given an undirected and unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-NDD identifies nodes that appear to correspond to same entity for merging (node deduplication), using only the network topology. From controlled experiments on benchmark networks, we find that FONDUE-NDA is substantially and consistently more accurate with lower computational cost in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives.

Download Full-text

Quality of sentiment analysis tools

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436924 ◽

2020 ◽

Vol 14 (4) ◽

pp. 668-681

Author(s):

Wissam Mammar Kouadri ◽

Mourad Ouziri ◽

Salima Benbernou ◽

Karima Echihabi ◽

Themis Palpanas ◽

...

Keyword(s):

Data Quality ◽

Sentiment Analysis ◽

Open Problem ◽

State Of The Art ◽

Research Directions ◽

Analysis Tools ◽

Public Datasets ◽

Comprehensive Study

In this paper, we present a comprehensive study that evaluates six state-of-the-art sentiment analysis tools on five public datasets, based on the quality of predictive results in the presence of semantically equivalent documents, i.e., how consistent existing tools are in predicting the polarity of documents based on paraphrased text. We observe that sentiment analysis tools exhibit intra-tool inconsistency , which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency , which is the prediction of different polarity for semantically equivalent documents across different tools. We introduce a heuristic to assess the data quality of an augmented dataset and a new set of metrics to evaluate tool inconsistencies. Our results indicate that tool inconsistencies is still an open problem, and they point towards promising research directions and accuracy improvements that can be obtained if such inconsistencies are resolved.

Download Full-text

fMRI Research on the Bilingual Brain

Annual Review of Applied Linguistics ◽

10.1017/s0267190514000038 ◽

2014 ◽

Vol 34 ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

Laura Sabourin

Keyword(s):

Sentence Processing ◽

Brain Research ◽

State Of The Art ◽

Future Research ◽

Functional Magnetic Resonance ◽

Research Directions ◽

Bilingual Advantage ◽

Pros And Cons ◽

Future Research Directions ◽

Research Questions

In this article, I review the use of the functional magnetic resonance imaging (fMRI) technique to investigate the bilingual brain. Specifically, this review will discuss the types of research questions that can be (and have been) answered using this specific methodology, as well as questions this technique cannot answer. The review will then focus on providing a recent overview of fMRI studies of the bilingual mental lexicon, bilingual sentence processing, and the bilingual advantage in cognitive control. The pros and cons of this technique will be discussed in detail. This review will end with a discussion of the state of the art in the field of bilingual brain research and will provide avenues for future research directions to continue investigating the bilingual brain.

Download Full-text

A Legitimacy Crisis of Representative Democracy?

10.1093/oso/9780198793717.003.0001 ◽

2017 ◽

Cited By ~ 1

Author(s):

Jacques Thomassen ◽

Carolien van Ham

Keyword(s):

Empirical Evidence ◽

State Of The Art ◽

Short Description ◽

The State ◽

Political Support ◽

Systematic Fashion ◽

Research Questions ◽

The 1960S ◽

Established Democracies ◽

Legitimacy Crisis

This chapter presents the research questions and outline of the book, providing a brief review of the state of the art of legitimacy research in established democracies, and discusses the recurring theme of crisis throughout this literature since the 1960s. It includes a discussion of the conceptualization and measurement of legitimacy, seeking to relate legitimacy to political support, and reflecting on how to evaluate empirical indicators: what symptoms indicate crisis? This chapter further explains the structure of the three main parts of the book. Part I evaluates in a systematic fashion the empirical evidence for legitimacy decline in established democracies; Part II reappraises the validity of theories of legitimacy decline; and Part II investigates what (new) explanations can account for differences in legitimacy between established democracies. The chapter concludes with a short description of the chapters included in the volume.

Download Full-text

Users roles identification on online crowdsourced Q&A platforms and encyclopedias: a survey

Journal of Computational Social Science ◽

10.1007/s42001-021-00125-9 ◽

2021 ◽

Author(s):

Akrati Saxena ◽

Harita Reddy

Keyword(s):

Network Analysis ◽

Knowledge Sharing ◽

Informal Learning ◽

Social Roles ◽

Online Discussion ◽

State Of The Art ◽

Temporal Analysis ◽

The State ◽

Open Research ◽

Research Questions

AbstractOnline informal learning and knowledge-sharing platforms, such as Stack Exchange, Reddit, and Wikipedia have been a great source of learning. Millions of people access these websites to ask questions, answer the questions, view answers, or check facts. However, one interesting question that has always attracted the researchers is if all the users share equally on these portals, and if not then how the contribution varies across users, and how it is distributed? Do different users focus on different kinds of activities and play specific roles? In this work, we present a survey of users’ social roles that have been identified on online discussion and Q&A platforms including Usenet newsgroups, Reddit, Stack Exchange, and MOOC forums, as well as on crowdsourced encyclopedias, such as Wikipedia, and Baidu Baike, where users interact with each other through talk pages. We discuss the state of the art on capturing the variety of users roles through different methods including the construction of user network, analysis of content posted by users, temporal analysis of user activity, posting frequency, and so on. We also discuss the available datasets and APIs to collect the data from these platforms for further research. The survey is concluded with open research questions.

Download Full-text

Disaster Resilience of Optical Networks: State of the Art, Challenges, and Opportunities

Optical Switching and Networking ◽

10.1016/j.osn.2021.100619 ◽

2021 ◽

pp. 100619

Author(s):

Jacek Rak ◽

Rita Girão-Silva ◽

Teresa Gomes ◽

Georgios Ellinas ◽

Burak Kantarci ◽

...

Keyword(s):

Optical Networks ◽

State Of The Art ◽

Disaster Resilience ◽

Challenges And Opportunities

Download Full-text

Towards Scalable Economic Photovoltaic Potential Analysis Using Aerial Images and Deep Learning

Energies ◽

10.3390/en14133800 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3800

Author(s):

Sebastian Krapf ◽

Nils Kemmerzell ◽

Syed Khawaja Haseeb Khawaja Haseeb Uddin ◽

Manuel Hack Hack Vázquez ◽

Fabian Netzler ◽

...

Keyword(s):

Deep Learning ◽

System Analysis ◽

State Of The Art ◽

Critical Role ◽

Semantic Segmentation ◽

Energy System ◽

Aerial Images ◽

Potential Analysis ◽

3D Data ◽

Challenges And Opportunities

Roof-mounted photovoltaic systems play a critical role in the global transition to renewable energy generation. An analysis of roof photovoltaic potential is an important tool for supporting decision-making and for accelerating new installations. State of the art uses 3D data to conduct potential analyses with high spatial resolution, limiting the study area to places with available 3D data. Recent advances in deep learning allow the required roof information from aerial images to be extracted. Furthermore, most publications consider the technical photovoltaic potential, and only a few publications determine the photovoltaic economic potential. Therefore, this paper extends state of the art by proposing and applying a methodology for scalable economic photovoltaic potential analysis using aerial images and deep learning. Two convolutional neural networks are trained for semantic segmentation of roof segments and superstructures and achieve an Intersection over Union values of 0.84 and 0.64, respectively. We calculated the internal rate of return of each roof segment for 71 buildings in a small study area. A comparison of this paper’s methodology with a 3D-based analysis discusses its benefits and disadvantages. The proposed methodology uses only publicly available data and is potentially scalable to the global level. However, this poses a variety of research challenges and opportunities, which are summarized with a focus on the application of deep learning, economic photovoltaic potential analysis, and energy system analysis.

Download Full-text