Integrating data from disparate data systems for improved HIV reporting: Lessons learned

Objective: To assess the integration process of HIV data from disparate sources for reporting HIV prevention metrics in Scott County, IndianaIntroduction: In 2015, the Indiana State Department of Health (ISDH) responded to a large HIV outbreak among persons who inject drugs (PWID) in Scott County1. Information to manage the public health response to this event and its aftermath included data from multiple sources such as surveillance, HIV testing, contact tracing, medical care, and HIV prevention activities. Each dataset was managed separately and had been tailored to the relevant HIV program area’s needs, which is a typical practice for health departments. Currently, integrating these disparate data sources is managed manually, which makes this dataset susceptible to inconsistent and redundant data. During the outbreak investigation, access to data to monitor and report progress was difficult to obtain in a timely and accurate manner for local and state health department staff. ISDH initiated efforts to integrate these disparate HIV data sources to better track HIV prevention metrics statewide, to support decision making and policies, and to facilitate a more rapid response to future HIV-related investigations. The Centers for Disease Control and Prevention (CDC) through its Info-Aid mechanism is providing technical assistance to support assessment of the ISDH data integration process. The project is expected to lead to the development of a dashboard prototype that will aggregate and improve critical data reporting to monitor the status of HIV prevention in Scott County.Methods: We assessed six different HIV-related datasets in addition to the state-level integrated HIV dataset developed to report HIV monitoring and prevention metrics. We conducted site visits to the ISDH and Scott County to assess the integration process. We also conducted key informant interviews and focus group discussions with data managers, analysts, program managers, and epidemiologists using HIV data systems at ISDH, Scott County and CDC. We also conducted a documentation review of summary reports of the HIV outbreak, workflow, a business process analysis, and information gathered during the site visit on operations, processes and attributes of HIV data sources. We, then, summarized the information flow, including the data collection process, reporting, and analysis at federal, state and county levels.Results: We have developed a list of lessons learned that can be translated for use in any state-level jurisdiction engaged in HIV prevention monitoring and reporting:Standardization of data formats: The disparate data sources storing HIV-related information were developed independently on different platforms using different architectures; they were not necessarily designed to link and exchange data. Hence, these systems could not seamlessly interact with each other, posing challenges when rapid data linkage was needed.To better manage unstructured data coming from disparate data sources and improve data integration efforts, we recommend standardization of data formats, unique identifiers for registered individuals, and coding across data systems. Use of standard operating procedures can streamline data flow and facilitate automated creation of integrated datasets. This approach may be helpful for future integration efforts in other healthcare domains.Data integration process: Manually integrating data is time intensive, increases workload, and poses significant risk of human error in data compilation. Hence, it may compromise data quality and the accuracy of HIV prevention metrics used by decision-makers.We propose an automated integration process using an extract, transform and load (ETL) method to extract HIV-related data from disparate data sources, transforming it to fit the prevention metrics reporting needs and loading it into a state-level integrated HIV dataset or database. This approach can drastically decrease dependency on manual methods and help avoid data compilation errors.Dashboard development: Major challenges in the process of integrating HIV-related data included disparate data sources, compromised data quality, and the lack of standard metrics for some of the HIV-related metrics of interest. Despite these challenges to data integration, creation of a dashboard to track HIV prevention metrics is feasible. Integrating data is a critical part of developing an HIV dashboard that can generate real-time metrics without creating additional burden for the health department staff, if manual integration is no longer needed. Stakeholder participation: Due to the immediate need for outbreak response, involvement of stakeholders at all levels was limited. Active stakeholder engagement in this process is essential. The stakeholders’ interest and participation can be improved by helping them understand the value of each other’s data, and providing regular feedback about their data and its best use in public health interventions.Conclusions: This assessment highlighted the importance of standardizing data formats, coding across systems for HIV data, and the use of unique identifiers to store individuals’ information across data systems. Promoting stakeholder understanding of the value and best use of their data is also essential in improving data integration efforts. The results of this assessment offer an opportunity to learn and apply these lessons to improve future public health informatics initiatives, including HIV (but not limited to HIV), at any state-level jurisdiction

Download Full-text

Integration of Data Sources through Data Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch118 ◽

2011 ◽

pp. 625-629

Author(s):

Andreas Koeller

Keyword(s):

Data Mining ◽

Data Integration ◽

Data Transformation ◽

Data Sources ◽

Integration Process

Integration of data sources refers to the task of developing a common schema as well as data transformation solutions for a number of data sources with related content. The large number and size of modern data sources make manual approaches at integration increasingly impractical. Data mining can help to partially or fully automate the data integration process.

Download Full-text

Identifying At-Risk Communities and Key Vulnerability Indicators in the COVID-19 Pandemic

10.1101/2021.09.19.21263805 ◽

2021 ◽

Author(s):

Savannah Thais ◽

Shaine Leibowitz ◽

Alejandra Rios Gutierrez ◽

Alexandra Passarelli ◽

Stephanie Santo ◽

...

Keyword(s):

Public Health ◽

Health Inequities ◽

Data Sources ◽

Risk Assessments ◽

The Public ◽

Community Needs ◽

Vulnerability Indicators ◽

Disparate Data ◽

Risk Communities ◽

Robust Integration

AbstractThroughout the COVID-19 pandemic, certain communities have been disproportionately exposed to detrimental health outcomes and socioeconomic injuries. Quantifying community needs is crucial for identifying testing and service deserts, effectively allocating resources, and informing funding and decision making. We have constructed research-driven metrics measuring the public health and economic impacts of COVID-19 on vulnerable populations. In this work we further examine and validate these indices by training supervised models to predict proxy outcomes and analyzing the feature importances to identify gaps in our original metric design. The indices analyzed in this work are unique among COVID-19 risk assessments due to their robust integration of disparate data sources. Together, they enable more effective responses to COVID-19 driven health inequities.

Download Full-text

An ontology-based documentation of data discovery and integration process in cancer outcomes research

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01270-3 ◽

2020 ◽

Vol 20 (S4) ◽

Cited By ~ 1

Author(s):

Hansi Zhang ◽

Yi Guo ◽

Mattia Prosperi ◽

Jiang Bian

Keyword(s):

Risk Factors ◽

Data Integration ◽

Outcomes Research ◽

Reporting Guideline ◽

Data Sources ◽

Integration Process ◽

Cancer Outcomes ◽

Source Selection ◽

Data Source ◽

Multi Level

Abstract Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text

An Ontology-based Approach to Guide and Document Variable and Data Source Selection and Data Integration Process to Support Integrative Data Analysis in Cancer Outcomes Research

10.1101/2020.05.28.20115907 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hansi Zhang ◽

Yi Guo ◽

Jiang Bian

Keyword(s):

Risk Factors ◽

Data Analysis ◽

Data Integration ◽

Reporting Guideline ◽

Data Sources ◽

Integration Process ◽

Cancer Outcomes ◽

Source Selection ◽

Data Source ◽

Multi Level

AbstractBackgroundTo reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.MethodsInformed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies.ResultsWe summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST.ConclusionOur ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Download Full-text

Integration of Data Sources through Data Mining

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch023 ◽

2008 ◽

pp. 350-355

Author(s):

Andreas Koeller

Keyword(s):

Data Mining ◽

Data Integration ◽

Data Transformation ◽

Data Sources ◽

Integration Process

Download Full-text

Integration of Data Sources through Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch163 ◽

2011 ◽

pp. 1053-1057

Author(s):

Andreas Koeller

Keyword(s):

Data Mining ◽

Data Integration ◽

Data Transformation ◽

Data Sources ◽

Integration Process

Download Full-text

Preparing Public Health for New HIV Prevention Technologies: A Road Map for Comprehensive Action in Canada

PsycEXTRA Dataset ◽

10.1037/e507502013-115 ◽

2012 ◽

Author(s):

Marnie Davidson

Keyword(s):

Public Health ◽

Hiv Prevention ◽

Road Map

Download Full-text

Evaluating Public Health Data Systems: A Practical Approach

PsycEXTRA Dataset ◽

10.1037/e583982012-001 ◽

1995 ◽

Author(s):

Phyllis Blood

Keyword(s):

Public Health ◽

Health Data ◽

Practical Approach ◽

Data Systems ◽

Public Health Data

Download Full-text

Can increases in Twitter posts predict increases in cumulative incidence of COVID-19 in the United States? Evidence that social media can inform epidemic surveillance. (Preprint)

10.2196/preprints.21132 ◽

2020 ◽

Author(s):

Ruoyan Sun ◽

Henna Budhwani

Keyword(s):

Public Health ◽

United States ◽

Cumulative Incidence ◽

Hot Spots ◽

Adult Population ◽

State Level ◽

The United States ◽

Education Attainment ◽

Public Health Systems ◽

Per Capita

BACKGROUND Though public health systems are responding rapidly to the COVID-19 pandemic, outcomes from publicly available, crowd-sourced big data may assist in helping to identify hot spots, prioritize equipment allocation and staffing, while also informing health policy related to “shelter in place” and social distancing recommendations. OBJECTIVE To assess if the rising state-level prevalence of COVID-19 related posts on Twitter (tweets) is predictive of state-level cumulative COVID-19 incidence after controlling for socio-economic characteristics. METHODS We identified extracted COVID-19 related tweets from January 21st to March 7th (2020) across all 50 states (N = 7,427,057). Tweets were combined with state-level characteristics and confirmed COVID-19 cases to determine the association between public commentary and cumulative incidence. RESULTS The cumulative incidence of COVID-19 cases varied significantly across states. Ratio of tweet increase (p=0.03), number of physicians per 1,000 population (p=0.01), education attainment (p=0.006), income per capita (p = 0.002), and percentage of adult population (p=0.003) were positively associated with cumulative incidence. Ratio of tweet increase was significantly associated with the logarithmic of cumulative incidence (p=0.06) with a coefficient of 0.26. CONCLUSIONS An increase in the prevalence of state-level tweets was predictive of an increase in COVID-19 diagnoses, providing evidence that Twitter can be a valuable surveillance tool for public health.

Download Full-text

Methodology of Big Data Integration from A Priori Unknown Heterogeneous Data Sources

Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence - CSAI '18 ◽

10.1145/3297156.3297249 ◽

2018 ◽

Author(s):

Alexey Samoylov ◽

Nikolay Sergeev ◽

Margarita Kucherova ◽

Boris Denisov

Keyword(s):

Big Data ◽

Data Integration ◽

A Priori ◽

Heterogeneous Data ◽

Data Sources ◽

Heterogeneous Data Sources

Download Full-text