KLL±approximate quantile sketches over dynamic datasets

2021 ◽  
Vol 14 (7) ◽  
pp. 1215-1227
Author(s):  
Fuheng Zhao ◽  
Sujaya Maiyya ◽  
Ryan Wiener ◽  
Divyakant Agrawal ◽  
Amr El Abbadi

Recently the long standing problem of optimal construction of quantile sketches was resolved byKarnin,Lang, andLiberty using the KLL sketch (FOCS 2016). The algorithm for KLL is restricted to online insert operations and no delete operations. For many real-world applications, it is necessary to support delete operations. When the data set is updated dynamically, i.e., when data elements are inserted and deleted, the quantile sketch should reflect the changes. In this paper, we proposeKLL±, the first quantile approximation algorithm to operate in thebounded deletionmodel to account for both inserts and deletes in a given data stream. KLL±extends the functionality of KLL sketches to support arbitrary updates with small space overhead. The space bound for KLL±is [EQUATION], where ∈ and δ are constants that determine precision and failure probability, and α bounds the number of deletions with respect to insert operations. The experimental evaluation of KLL±highlights that with minimal space overhead, KLL±achieves comparable accuracy in quantile approximation to KLL.

2016 ◽  
Vol 13 (10) ◽  
pp. 7467-7474
Author(s):  
Venu Madhav Kuthadi ◽  
Rajalakshmi Selvaraj

A data stream is a continuous sequence of data elements generated from a specified source. Mining frequent item sets in dynamic databases and data streams encounters some challenges that make the mining task harder than static databases. Many research works were developed in the frequent itemset mining, but these methods have the familiar problem of memory usage and processing time. Because, in data streams data elements are arrive at a rapid rate. The incoming data is unbounded and probably infinite. Due to high speed and large amount of incoming data, frequent item set mining algorithm must require a limited memory and processing time. To reduce this drawback in the existing method, a new algorithm is proposed in this paper. Here, a new algorithm is named as CFIM is developed for mining closed frequent item sets from the data streams based on their utility and consistency. During the closed frequent item sets mining, a hash table is maintained to check whether the given item set is closed or not. The computation of closed frequent item sets from the data stream will minimize the memory usage and processing time. Thus our proposed technique performance is analyzed by using the synthetic data set and compared with the exiting mining techniques.


Author(s):  
Eugenia Rinaldi ◽  
Sylvia Thun

HiGHmed is a German Consortium where eight University Hospitals have agreed to the cross-institutional data exchange through novel medical informatics solutions. The HiGHmed Use Case Infection Control group has modelled a set of infection-related data in the openEHR format. In order to establish interoperability with the other German Consortia belonging to the same national initiative, we mapped the openEHR information to the Fast Healthcare Interoperability Resources (FHIR) format recommended within the initiative. FHIR enables fast exchange of data thanks to the discrete and independent data elements into which information is organized. Furthermore, to explore the possibility of maximizing analysis capabilities for our data set, we subsequently mapped the FHIR elements to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). The OMOP data model is designed to support the conduct of research to identify and evaluate associations between interventions and outcomes caused by these interventions. Mapping across standard allows to exploit their peculiarities while establishing and/or maintaining interoperability. This article provides an overview of our experience in mapping infection control related data across three different standards openEHR, FHIR and OMOP CDM.


2014 ◽  
Vol 22 (2) ◽  
pp. 409-416 ◽  
Author(s):  
Andy Amster ◽  
Joseph Jentzsch ◽  
Ham Pasupuleti ◽  
K G Subramanian

Abstract Objective To analyze the completeness, computability, and accuracy of specifications for five National Quality Forum-specified (NQF) eMeasures spanning ambulatory, post-discharge, and emergency care within a comprehensive, integrated electronic health record (EHR) environment. Materials and methods To evaluate completeness, we assessed eMeasure logic, data elements, and value sets. To evaluate computability, we assessed the translation of eMeasure algorithms to programmable logic constructs and the availability of EHR data elements to implement specified data criteria, using a de-identified clinical data set from Kaiser Permanente Northwest. To assess accuracy, we compared eMeasure results with those obtained independently by existing audited chart abstraction methods used for external and internal reporting. Results One measure specification was incomplete; missing applicable LOINC codes rendered it non-computable. For three of four computable measures, data availability issues occurred; the literal specification guidance for a data element differed from the physical implementation of the data element in the EHR. In two cases, cross-referencing specified data elements to EHR equivalents allowed variably accurate measure computation. Substantial data availability issues occurred for one of the four computable measures, producing highly inaccurate results. Discussion Existing clinical workflows, documentation, and coding in the EHR were significant barriers to implementing eMeasures as specified. Implementation requires redesigning business or clinical practices and, for one measure, systemic EHR modifications, including clinical text search capabilities. Conclusions Five NQF eMeasures fell short of being machine-consumable specifications. Both clinical domain and technological expertise are required to implement manually intensive steps from data mapping to text mining to EHR-specific eMeasure implementation.


2017 ◽  
Author(s):  
Stefan Nowak ◽  
Johannes Neidhart ◽  
Jonas Rzezonka ◽  
Ivan G. Szendro ◽  
Rahul Marathe ◽  
...  

A long-standing problem in ageing research is to understand how different factors contributing to longevity should be expected to act in combination under the assumption that they are independent. Standard epistasis analysis compares the extension of mean lifespan achieved by a combination of interventions to the prediction under an additive or multiplicative null model, but neither model is fundamentally justified. Moreover, the target of longevity interventions is not mean life span but the entire survival curve. Here we formulate superposition principles that predict the survival curve resulting from a combination of two interventions based on the survival curves of the individual treatments, and quantify epistasis as the deviation from this prediction. We test the method on a published data set comprising survival curves for all combinations of 4 different longevity interventions in Caenorhabditis elegans. We find that epistasis is generally weak even when the standard analysis indicates otherwise.


Author(s):  
Joseph Travers ◽  
Crystal Campitelli ◽  
Richard Light ◽  
Eric De Sa ◽  
Julie Stabile ◽  
...  

IntroductionThe professional regulation sector is moving toward risk-informed approaches that require high quality data. A key component of a corporate 2017 Data Strategy is the implementation of a data inventory and mapping project to catalogue, centralize, document and govern data assets that support regulatory decisions, programs and operations. Objectives and ApproachIn a data rich organization, the goals of the data inventory are to: enhance authoritative data that support programs; identify data duplications/gaps; identify data sources, owners and users; and, apply consistent data management and standards organizationally. Routinely used data assets outside the large enterprise workflow system (excel/word files; databases; paper collections) were catalogued. Using data governance principles and a facilitated questionnaire, departmental data stewards were interviewed about their generated data. Questions included data purpose/sources/types/formats/owners, retention rates, analytical products, gaps and visions for a desired data state. A data mapping methodology highlighted data set and variable connections within and across departments. ResultsTo date, over 40 staff members in 10 departments were identified as data content experts. In addition to data in the corporate enterprise system, over 80 unique datasets were identified. In 1 large department, over 2,000 data elements across 26 datasets were inventoried. Data mapping analysis revealed thematic data domains, including member demographics, outcomes, certifications, tracking and financial data, collected and held in multiple formats ((Microsoft Access, Excel, Word), SPSS, PDF, e-mails and paper documents). While 72% of the data elements were formatted numerically, approximately 8% were free text. Significant data redundancies across staff members and departments were revealed, as well as unstandardized variable naming conventions. Gaps analysis highlighted need for standardized, electronic data, where not available and data management training. Conclusion/ImplicationsCustomized data mapping reports to data users will facilitate the development of local, standardized departmental data hubs that will centrally link to a centralized data repository to facilitate seamless organization-wide analytics, improvements in current data management practices and greater data collaboration with the ultimate goal of supporting risk-informed approaches.


Author(s):  
Jia-Ling Koh ◽  
Shu-Ning Shin ◽  
Yuan-Bin Don

Recently, the data stream, which is an unbounded sequence of data elements generated at a rapid rate, provides a dynamic environment for collecting data sources. It is likely that the embedded knowledge in a data stream will change quickly as time goes by. Therefore, catching the recent trend of data is an important issue when mining frequent itemsets over data streams. Although the sliding window model proposed a good solution for this problem, the appearing information of patterns within a sliding window has to be maintained completely in the traditional approach. For estimating the approximate supports of patterns within a sliding window, the frequency changing point (FCP) method is proposed for monitoring the recent occurrences of itemsets over a data stream. In addition to a basic design proposed under the assumption that exact one transaction arrives at each time point, the FCP method is extended for maintaining recent patterns over a data stream where a block of various numbers of transactions (including zero or more transactions) is inputted within a fixed time unit. Accordingly, the recently frequent itemsets or representative patterns are discovered from the maintained structure approximately. Experimental studies demonstrate that the proposed algorithms achieve high true positive rates and guarantees no false dismissal to the results yielded. A theoretic analysis is provided for the guarantee. In addition, the authors’ approach outperforms the previously proposed method in terms of reducing the run-time memory usage significantly.


2021 ◽  
Vol 15 (02) ◽  
pp. 33-41
Author(s):  
Wendy Osborn

In this paper, the problem of query processing in spatial data streams is explored, with a focus on the spatial join operation. Although the spatial join has been utilized in many proposed centralized and distributed query processing strategies, for its application to spatial data streams the spatial join operation has received very little attention. One identified limitation with existing strategies is that a bounded region of space (i.e., spatial extent) from which the spatial objects are generated needs to be known in advance. However, this information may not be available. Therefore, two strategies for spatial data stream join processing are proposed where the spatial extent of the spatial object stream is not required to be known in advance. Both strategies estimate the common region that is shared by two or more spatial data streams in order to process the spatial join. An evaluation of both strategies includes a comparison with a recently proposed approach in which the spatial extent of the data set is known. Experimental results show that one of the strategies performs very well at estimating the common region of space using only incoming objects on the spatial data streams. Other limitations of this work are also identified.


Author(s):  
Piotr Kulczycki ◽  
Małgorzata Charytanowicz

A complete gradient clustering algorithm formed with kernel estimatorsThe aim of this paper is to provide a gradient clustering algorithm in its complete form, suitable for direct use without requiring a deeper statistical knowledge. The values of all parameters are effectively calculated using optimizing procedures. Moreover, an illustrative analysis of the meaning of particular parameters is shown, followed by the effects resulting from possible modifications with respect to their primarily assigned optimal values. The proposed algorithm does not demand strict assumptions regarding the desired number of clusters, which allows the obtained number to be better suited to a real data structure. Moreover, a feature specific to it is the possibility to influence the proportion between the number of clusters in areas where data elements are dense as opposed to their sparse regions. Finally, the algorithm—by the detection of oneelement clusters—allows identifying atypical elements, which enables their elimination or possible designation to bigger clusters, thus increasing the homogeneity of the data set.


CJEM ◽  
2001 ◽  
Vol 3 (04) ◽  
pp. 277-283 ◽  
Author(s):  
◽  
Grant Innes ◽  
Michael Murray ◽  
Eric Grafstein

Abstract Canadian hospitals gather few emergency department (ED) data, and most cannot track their case mix, care processes, utilization or outcomes. A standard national ED data set would enhance clinical care, quality improvement and research at a local, regional and national level. The Canadian Association of Emergency Physicians, the National Emergency Nurses Affiliation and l’Association des médecins d’urgence du Québec established a joint working group whose objective was to develop a standard national ED data set that meets the information needs of Canadian EDs. The working group reviewed data elements derived from Australia’s Victorian Emergency Minimum Dataset, the US Data Elements for Emergency Department Systems document, the Ontario Hospital Emergency Department Working Group data set and the Canadian Institute for Health Information’s National Ambulatory Care Reporting System data set. By consensus, the group defined each element as mandatory, preferred or optional, and modified data definitions to increase their relevance to the ED context. The working group identified 69 mandatory elements, 5 preferred elements and 29 optional elements representing demographic, process, clinical and utilization measures. The Canadian Emergency Department Information System data set is a feasible, relevant ED data set developed by emergency physicians and nurses and tailored to the needs of Canadian EDs. If widely adopted, it represents an important step toward a national ED information system that will enable regional, provincial and national comparisons and enhance clinical care, quality improvement and research applications in both rural and urban settings.


2015 ◽  
Vol 54 (05) ◽  
pp. 455-460 ◽  
Author(s):  
M. Ganzinger ◽  
T. Muley ◽  
M. Thomas ◽  
P. Knaup ◽  
D. Firnkorn

Summary Objective: Joint data analysis is a key requirement in medical research networks. Data are available in heterogeneous formats at each network partner and their harmonization is often rather complex. The objective of our paper is to provide a generic approach for the harmonization process in research networks. We applied the process when harmonizing data from three sites for the Lung Cancer Phenotype Database within the German Center for Lung Research. Methods: We developed a spreadsheet-based solution as tool to support the harmonization process for lung cancer data and a data integration procedure based on Talend Open Studio. Results: The harmonization process consists of eight steps describing a systematic approach for defining and reviewing source data elements and standardizing common data elements. The steps for defining common data elements and harmonizing them with local data definitions are repeated until consensus is reached. Application of this process for building the phenotype database led to a common basic data set on lung cancer with 285 structured parameters. The Lung Cancer Phenotype Database was realized as an i2b2 research data warehouse. Conclusion: Data harmonization is a challenging task requiring informatics skills as well as domain knowledge. Our approach facilitates data harmonization by providing guidance through a uniform process that can be applied in a wide range of projects.


Sign in / Sign up

Export Citation Format

Share Document