scholarly journals Collecting and visualizing data lineage of Spark jobs

Author(s):  
Alexander Schoenenwald ◽  
Simon Kern ◽  
Josef Viehhauser ◽  
Johannes Schildgen

AbstractMetadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Thus, collecting data lineage—describing the origin, structure, and dependencies of data—in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans. We build upon the open-source component Spline, allowing us to reliably consume lineage metadata and identify interdependencies. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub.

2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Diep T Vu ◽  
Duc H Bui ◽  
Giang T Le ◽  
Hai K Nguyen ◽  
Duong C Thanh ◽  
...  

Objective: To use Epi Info Cloud Data Analytics (ECDA) to improve the management, quality and utilization of the Vietnam National HIV Surveillance data.Introduction: HIV surveillance in Vietnam is comprised of different surveillance systems including the HIV sentinel surveillance (HSS). The HSS is an annual, multi-site survey to monitor HIV sero-prevalence and risk behaviors among key populations. In 2015, the Vietnam Administration on HIV/AIDS Control (VAAC) installed the Epi Info Cloud Data Analytics (ECDA), a free web-based analytical and visualization program developed by the Centers for Disease Control and Prevention (CDC)(1) to serve as an information management system for HIV surveillance. Until 2016, provincial surveys, recorded on paper, were computerized and submitted to VAAC, which was responsible for merging individual provincial datasets to form a national HSS dataset. Feedback on HSS issues were provided to provinces 3 to 6 months after survey conclusion. With the use of tablets for field data collection in 2017, provincial survey data were recorded electronically and transferred to VAAC at the end of each survey day, thus enabling instant updating of the national 2017 HSS dataset on daily basis. Upon availability of the national HSS dataset on VAAC’s server, ECDA enhanced wider access and prompt analysis for staff at all levels (figure 1). This abstract describes the use of ECDA, together with tablet-based data collection to improve management, quality and use of surveillance data.Methods: After the installation of the ECDA on VAAC’s server in 2015, investments were made at all levels of the surveillance systems to build the capacity to operate and maintain the ECDA. These included trainings on programming, administration, and utilization of ECDA at the central level; creating a centralized database through abstracting and linking different surveillance datasets; developing analysis templates to assist provincial-specific reports; and trainings on access and use of the ECDA to provincial staff. One hundred and eighty five ECDA analyst accounts, authorized for submission, viewing and analysis of data, were created for surveillance staff in 63 provinces and 7 agencies. Six administrator accounts, created for users at central and regional level, were authorized for editing data and management of user accounts. In 2017, more ECDA activities were conducted to: (i) develop analysis dashboards to track progress and data quality of HSS provincial surveys; (ii) facilitate frequent data reviews at central and regional levels; (iii) provide feedback to provinces on survey issues including sample selection.Results: Since 2015, separate national datasets including the HSS, HIV case reports, HIV routine program reports were systematically cleaned and merged to form a centralized national database, which was then centrally stored and regularly backed up. Access to the national database was granted to surveillance staff in all 63 provinces through 185 designated ECDA accounts. During the 2017 HSS surveys, 70 ECDA users in 20 HSS provinces were active to manage and use the HSS data. Twelve weekly reviews of HSS provincial data were conducted at national level throughout the 2017 HSS survey. Ninety percent of provinces received feedback on their survey data as early as the first week of field data collection. The national 2017 HSS dataset and its analysis were available immediately after the completion of the last provincial survey, which was about 3 to 6 months quicker than reports of previous years. More importantly, the fresh results of the 2017 HSS survey were available and used for the 2018 Vietnam HIV national planning circle (table 1).Conclusions: ECDA is a quick, relevant, free program to improve the management and analysis of HIV surveillance data. Using ECDA, it is easy to generate and modify analysis dashboards that enhances utilization of surveillance data. Successful administration and use of the ECDA during the 2017 HSS survey is positive evidence for Ministry of Health to consider institutionalization of the program in Vietnam surveillance systems.


Author(s):  
Jike Ge ◽  
Wenbo He ◽  
Zuqin Chen ◽  
Can Liu ◽  
Jun Peng ◽  
...  

This article describes how stateful data analytic frameworks have emerged to provide fresh and low-latency results for big data processing. At present, it is desired to achieve the fine-grained data model in Spark data processing framework. However, Spark adopts coarse-grained data model in order to facilitate parallelization, it is challenging in dealing with the fine-grained data access in stateful data analytics. In this paper, the authors introduce a fine-grained stateful data component, Resilient State Table (RST), to Spark framework. For filling the gap between the coarse-grained data model in Spark and the fine-grained data access requirements in stateful data analytics, they devise the programming model of RST which interacts with Spark's coarse-grained memory representation seamlessly, and enable users to query/update the state entries in fine granularity with Spark-like programming interfaces. Performance evaluation experiments in various application fields demonstrate that their proposed solution achieves the improvements in latency, fault-tolerance, as well as scalability.


2021 ◽  
Vol 13 (3) ◽  
pp. 1582
Author(s):  
Joe Butchers ◽  
Sam Williamson ◽  
Julian Booker

Evaluating the sustainable operation of community-owned and community-operated renewable energy projects is complex. The development of a project often depends on the actions of diverse stakeholders, including the government, industry and communities. Throughout the project cycle, these interrelated actions impact the sustainability of the project. In this paper, the typical project cycle of a micro-hydropower plant in Nepal is used to demonstrate that key events throughout the project cycle affect a plant’s ability to operate sustainably. Through a critical analysis of the available literature, policy and project documentation and interviews with manufacturers, drivers that affect the sustainability of plants are found. Examples include weak specification of civil components during tendering, quality control issues during manufacture, poor quality of construction and trained operators leaving their position. Opportunities to minimise both the occurrence and the severity of threats to sustainability are identified. For the micro-hydropower industry in Nepal, recommendations are made for specific actions by the relevant stakeholders at appropriate moments in the project cycle. More broadly, the findings demonstrate that the complex nature of developing community energy projects requires a holistic consideration of the complete project process.


Author(s):  
Steffen Kläbe ◽  
Kai-Uwe Sattler ◽  
Stephan Baumann

AbstractCloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.


2016 ◽  
Vol 7 (1) ◽  
pp. 4-35 ◽  
Author(s):  
Carlos-Maria Alcover ◽  
Ramón Rico ◽  
William H. Turnley ◽  
Mark C. Bolino

In recent years, scholars have increasingly recognized that the theoretical underpinnings of employee-organization relationships (EOR) are in need of further extension in light of recent organizational changes. In prior research, the study of EOR has been based on social exchange theory, and the psychological contract (PC) has played a central role in understanding this crucial aspect of organizational life. The main objective of this paper is to provide an integration of the existing literature by adopting a multiple-foci exchange relationships approach. Specifically, we looked at identification; the quality of relationships and exchanges with the leader, coworkers, and other organizational agents; justice perceptions involving several organizational sources; and perceived organizational, leader, and coworker support to expand our understanding of the PC. Overall, we advocate a multiple-foci exchange relationships approach that will ultimately enable us to develop a more comprehensive understanding of the complex nature of PCs in 21st century organizations.


Author(s):  
Badr O. Johar ◽  
Surendra M. Gupta

Reverse logistics is a critical topic that has captured the attention of government, private entities and researchers in recent years. This increase in the concern was driven by current set of government regulations, increase of public awareness, and the attractive economic opportunities. Also, environmentalists have always demanded Original Equipment Manufacturers (OEMs) to be more involved and be responsible of their products at the end of its life cycle. However, the uncertainty in quality of items returned, and its quantity discourage OEMs from participating in such programs. Because of the unique problems associated and the complex nature of the reverse logistics activities, numerous studies have been carried out in this field. One of those crucial areas is inventory management of End-of-Life (EOL) products. The take back program could possibly bring financial burden to OEM if it is not managed well. Thus, an efficient yet cost effective system should be implemented to appropriately manage the overwhelming number of returns. Previously, we have analyzed the problem based on the assumption that the number of core products returned and disassembled parts and subassemblies are known in advance. In this paper, we introduce a probabilistic approach where different quality levels of for every component disassembled are considered and different probabilities of these qualities given the quality of the returned product. The model utilizes a multi-period stochastic dynamic programming in a disassembly line context to solve the problem, and generate the best option that will maximize the system total profit. A numerical example is given to illustrate the approach. Finally, directions for future research are suggested.


2013 ◽  
Vol 303-306 ◽  
pp. 2203-2206
Author(s):  
Cai Li Fang ◽  
Lin Wu

As a important part of teaching, Traditional paper analysis algorithm is complicated and not comprehensive enough. This paper according to the indicators of education measurement and improves the traditional calculation models, the system assesses the perfect calculation model, The teacher also can get the true and fast reaction to students' knowledge and the quality of the paper. This method thereby provides a convincing basis for the improvement on the future teaching and increase the effectiveness in teaching


2021 ◽  
pp. 088626052110642
Author(s):  
Nkiru Nnawulezi ◽  
Jasmine Engleton ◽  
Selima Jumarali ◽  
Samantha Royson ◽  
Christopher Murphy

As formal crisis responders, police are trained in de-escalation tactics that are expected to mitigate intimate partner violence and promote survivor safety. However, the alignment between expected and actual practice of police intervention varies, especially when the survivor does not initiate the call, police treat the survivor poorly, or provide an undesirable arrest outcome. At best, unsuccessful interventions do not change survivors’ risk level, and at worse, elevate their risk of experiencing harm. The purpose of this qualitative study was to explore survivors’ perspectives on the process of police intervention, specifically how variations in initiation, quality of engagement, and arrest influence survivors’ safety. Twenty-four women whose partners were in a relationship violence intervention program were recruited to participate in the study. Results showed that many survivors described a range of ongoing, strategic violence perpetrated by their partners that required intervention; yet the complex nature of the violence often extended beyond police capacity. Either survivors called the police, or they were initiated externally by neighbors or strangers; some survivors had dual initiations. Whether survivors reported that police used safety practices during the intervention was related to who initiated the police. Arrests of abusive partners were inconsistent, and they varied based on number of previous calls to the police and visible signs of injury. Survivors of color, specifically Black women, self-initiated at higher rates, experienced fewer safety strategies used by police, and had fewer arrests. No matter the outcomes of police intervention, survivors actively engaged in strategies outside of formal systems to protect themselves and their families. Study results imply that police intervention may be ill-suited to support survivors’ safety goals and highlight a need for alternative interventions focused on de-escalation and prevention.


Sign in / Sign up

Export Citation Format

Share Document