A Generic Data Harmonization Process for Cross-linked Research and Network Interaction

2015 ◽  
Vol 54 (05) ◽  
pp. 455-460 ◽  
Author(s):  
M. Ganzinger ◽  
T. Muley ◽  
M. Thomas ◽  
P. Knaup ◽  
D. Firnkorn

Summary Objective: Joint data analysis is a key requirement in medical research networks. Data are available in heterogeneous formats at each network partner and their harmonization is often rather complex. The objective of our paper is to provide a generic approach for the harmonization process in research networks. We applied the process when harmonizing data from three sites for the Lung Cancer Phenotype Database within the German Center for Lung Research. Methods: We developed a spreadsheet-based solution as tool to support the harmonization process for lung cancer data and a data integration procedure based on Talend Open Studio. Results: The harmonization process consists of eight steps describing a systematic approach for defining and reviewing source data elements and standardizing common data elements. The steps for defining common data elements and harmonizing them with local data definitions are repeated until consensus is reached. Application of this process for building the phenotype database led to a common basic data set on lung cancer with 285 structured parameters. The Lung Cancer Phenotype Database was realized as an i2b2 research data warehouse. Conclusion: Data harmonization is a challenging task requiring informatics skills as well as domain knowledge. Our approach facilitates data harmonization by providing guidance through a uniform process that can be applied in a wide range of projects.

Thorax ◽  
2017 ◽  
Vol 73 (4) ◽  
pp. 339-349 ◽  
Author(s):  
Margreet Lüchtenborg ◽  
Eva J A Morris ◽  
Daniela Tataru ◽  
Victoria H Coupland ◽  
Andrew Smith ◽  
...  

IntroductionThe International Cancer Benchmarking Partnership (ICBP) identified significant international differences in lung cancer survival. Differing levels of comorbid disease across ICBP countries has been suggested as a potential explanation of this variation but, to date, no studies have quantified its impact. This study investigated whether comparable, robust comorbidity scores can be derived from the different routine population-based cancer data sets available in the ICBP jurisdictions and, if so, use them to quantify international variation in comorbidity and determine its influence on outcome.MethodsLinked population-based lung cancer registry and hospital discharge data sets were acquired from nine ICBP jurisdictions in Australia, Canada, Norway and the UK providing a study population of 233 981 individuals. For each person in this cohort Charlson, Elixhauser and inpatient bed day Comorbidity Scores were derived relating to the 4–36 months prior to their lung cancer diagnosis. The scores were then compared to assess their validity and feasibility of use in international survival comparisons.ResultsIt was feasible to generate the three comorbidity scores for each jurisdiction, which were found to have good content, face and concurrent validity. Predictive validity was limited and there was evidence that the reliability was questionable.ConclusionThe results presented here indicate that interjurisdictional comparability of recorded comorbidity was limited due to probable differences in coding and hospital admission practices in each area. Before the contribution of comorbidity on international differences in cancer survival can be investigated an internationally harmonised comorbidity index is required.


2019 ◽  
Vol 21 (3) ◽  
pp. 851-862 ◽  
Author(s):  
Charalampos Papachristou ◽  
Swati Biswas

Abstract Dissecting the genetic mechanism underlying a complex disease hinges on discovering gene–environment interactions (GXE). However, detecting GXE is a challenging problem especially when the genetic variants under study are rare. Haplotype-based tests have several advantages over the so-called collapsing tests for detecting rare variants as highlighted in recent literature. Thus, it is of practical interest to compare haplotype-based tests for detecting GXE including the recent ones developed specifically for rare haplotypes. We compare the following methods: haplo.glm, hapassoc, HapReg, Bayesian hierarchical generalized linear model (BhGLM) and logistic Bayesian LASSO (LBL). We simulate data under different types of association scenarios and levels of gene–environment dependence. We find that when the type I error rates are controlled to be the same for all methods, LBL is the most powerful method for detecting GXE. We applied the methods to a lung cancer data set, in particular, in region 15q25.1 as it has been suggested in the literature that it interacts with smoking to affect the lung cancer susceptibility and that it is associated with smoking behavior. LBL and BhGLM were able to detect a rare haplotype–smoking interaction in this region. We also analyzed the sequence data from the Dallas Heart Study, a population-based multi-ethnic study. Specifically, we considered haplotype blocks in the gene ANGPTL4 for association with trait serum triglyceride and used ethnicity as a covariate. Only LBL found interactions of haplotypes with race (Hispanic). Thus, in general, LBL seems to be the best method for detecting GXE among the ones we studied here. Nonetheless, it requires the most computation time.


2020 ◽  
Vol 8 (6) ◽  
pp. 1623-1630

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.


Author(s):  
Kamala Adhikari Dahal ◽  
Scott Patten ◽  
Tyler Williamson ◽  
Alka Patel ◽  
Shahirose Premji ◽  
...  

IntroductionPooling data from cohort studies can be used to increase sample size. However, individual datasets may contain variables that measure the same construct differently, posing challenges in the usefulness of combined datasets. Variable harmonization (an effort that provides comparable view of data from different studies) may address this issue. Objectives and ApproachThis study harmonized existing datasets from two prospective pregnancy cohort studies in Alberta Canada (All Our Families (n=3,351) and Alberta Pregnancy Outcome and Nutrition (n=2,187)). Given the comparability of the characteristics of the two cohorts and similarities of the core data elements of interest, data harmonization was justifiable. Data harmonization was performed considering multiple factors, such as complete or partial variable matching regarding question asked/responded, the response coded (value level, value definition, data type), the frequency of measurement, the pregnancy time-period of measurement, and missing values. Multiple imputation was used to address missing data resulting from the data harmonization process. ResultsSeveral variables such as ethnicity, income, parity, gestational age, anxiety, and depression were harmonized using different procedures. If the question asked/answered and the response recorded was the same in both datasets, no variable manipulation was done. If the response recorded was different, the response was re-categorized/re-organized to optimize comparability of data from both datasets. Missing values were created for each resulting unmatched variables and were replaced using multiple imputation if the same construct was measured in both datasets but using different ways/scales. A scale that was used in both datasets was identified as a reference standard. If the variables were measured in multiple times and/or different time-periods, variables were synchronized using pregnancy trimesters data. Finally, harmonized datasets were then combined/pooled into a single dataset (n=5,588). Conclusion/ImplicationsVariable harmonization is an important aspect of conducting research using multiple datasets. It provides an opportunity to increase study power through maximizing sample size, permitting more sophisticated statistical analyses, and to answer novel research questions that could not be addressed using a single study.


2012 ◽  
Vol 30 (34_suppl) ◽  
pp. 293-293
Author(s):  
Roy B. Jones ◽  
Charles Martinez ◽  
J. Douglas Rizzo ◽  
Dianne Reeves

293 Background: All U.S. transplant centers must report comprehensive SCT outcome data to a federal registry. The current electronic data capture method requires manual data entry of 700+ unique data elements into an internet application, FormsNet (Center for International Blood and Marrow Transplant Research). Data is routinely copied from program databases or the EMR by manual transcription, an inefficient, inaccurate and expensive process. A method was needed to allow electronic transmission of outcome data directly from these databases to the mandated SCT Outcomes Database (SCTOD). Methods: We designed an interface engine (IE) to transmit structured data through a caGRID subnet (AGNIS) directly to the SCTOD from a proprietary MDACC SCT database using a secure, auditable method. To make this method applicable to other centers, we collaborated with the NCI and others to expand the Biomedical Research Informatics Domain Group (BRIDG) standard data model to support a full set of granular data elements (>1,900) required to describe SCT outcomes. The IE was modified to transmit SCT data from the expanded BRIDG database to the SCTOD. The IE and expanded BRIDG database model will be made available to all centers without charge. In this way centers interfacing data to this new structure can transmit data to the SCTOD without transcription. Results: The BRIDG oversight committee has approved the extended model and made its structure and content publically available. The IE has been used to transmit >4,000 data forms from MDACC to the SCTOD. The full set of SCT common data elements (CDE) has been published in the Cancer Data Standards Repository of the NCI. The American Society of Blood and Marrow Transplantation is publishing an RFA to identify 1-3 vendors qualified to interface data from center-specific systems to the BRIDG database. Conclusions: Comprehensive and direct electronic data transmission to the SCTOD is feasible and can be done without modifying individual center’s legacy applications. The plan will make appropriate tools available to all transplant centers. This paradigm should be applicable to other areas of oncology.


Author(s):  
I.A. Borisova ◽  
O.A. Kutnenko

The problem of outliers detection is one of the important problems in Data Mining of biomedical datasets particularly in case when there could be misclassified objects, caused by diagnostic pitfalls on a stage of a data collection. Occurrence of such objects complicates and slows down dataset processing, distorts and corrupts detected regularities, reduces their accuracy score. We propose the censoring algorithm which could detect misclassified objects after which they are either removed from the dataset or the class attribute of such objects is corrected. Correction procedure keeps the volume of the analyzed dataset as big as it is possible. Such quality is very useful in case of small datasets analysis, when every bit of information can be important. The base concept in the presented work is a measure of similarity of objects with its surroundings. To evaluate the local similarity of the object with its closest neighbors the ternary relative measure called the function of rival similarity (FRiS-function) is used. Mean of similarity values of all objects in the dataset gives us a notion of a class’s separability, how close objects from the same class are to each other and how far they are from the objects of the different classes (with the different diagnosis) in the attribute space. It is supposed misclassified objects are more similar to objects from rival classes, than their own class, so their elimination from the dataset, or the target attribute correction should increase data separability value. The procedure of filtering-correcting of misclassified objects is based on the observation of changes in the evaluation of data separability calculated before and after making corrections to the dataset. The censoring process continues until the inflection point of the separability function is reached. The proposed algorithm was tested on a wide range of model tasks of different complexity. Also it was tested on biomedical tasks such as Pima Indians Diabetes data set, Breast Cancer data set and Parkinson data set. On these tasks the censoring algorithm showed high misclassification sensitivity. Accuracy score increasing and data set volume preservation after censoring procedure proved our base assumptions and the effectiveness of the algorithm.


2018 ◽  
Author(s):  
Paul F. Harrison ◽  
Andrew D. Pattison ◽  
David R. Powell ◽  
Traude H. Beilharz

AbstractBackgroundA differential gene expression analysis may produce a set of significantly differentially expressed genes that is too large to easily investigate, so that a means of ranking genes by their biological interest level is desirable. The life-sciences have grappled with the abuse of p-values to rank genes for this purpose. As an alternative, a lower confidence bound on the magnitude of Log Fold Change (LFC) could be used to rank genes, but it has been unclear how to reconcile this with the need to perform False Discovery Rate (FDR) correction. The TREAT test of McCarthy and Smyth is a step in this direction, finding genes significantly exceeding a specified LFC threshold. Here we describe the use of test inversion on TREAT to present genes ranked by a confidence bound on the LFC, while still controlling FDR.ResultsTesting the Topconfects R package with simulated gene expression data shows the method outperforming current statistical approaches across a wide range of experiment sizes in the identification of genes with largest LFCs. Applying the method to a TCGA breast cancer data-set shows the method ranks some genes with large LFC higher than would traditional ranking by p-value. Importantly these two ranking methods lead to a different biological emphasis, in terms both of specific highly ranked genes and gene-set enrichment.ConclusionsThe choice of ranking method in differential expression analysis can affect the biological interpretation. The common default of ranking by p-value is implicitly by an effect size in which each gene is standardized to its own variability, rather than comparing genes on a common scale, which may not be appropriate. The Topconfects approach of presenting genes ranked by confident LFC effect size is a variation on the TREAT method with improved usability, removing the need to fine-tune a threshold parameter and removing the temptation to abuse p-values as a de-facto effect size.


2021 ◽  
Author(s):  
Thomas Opladen ◽  
Florian Gleich ◽  
Viktor Kozich ◽  
Maurizio Scarpa ◽  
Diego Martinelli ◽  
...  

Abstract Background: Following the broad application of new analytical methods, more and more pathophysiological processes in previously unknown diseases have been elucidated. The spectrum of clinical presentation of rare inherited metabolic diseases (IMDs) is broad and ranges from single organ involvement to multisystemic diseases. With the aim of overcoming the limited knowledge about the natural course, current diagnostic and therapeutic approaches, the project has established the first unified patient registry for IMDs that fully meets the requirements of the European Infrastructure for Rare Diseases (ERDRI). Results : In collaboration with the European Reference Network for Rare Hereditary Metabolic Disorders (MetabERN), the Unified European registry for Inherited Metabolic Diseases (U-IMD) was established to collect patient data as an observational, non-interventional natural history study. Following the recommendations of the ERDRI the U-IMD registry uses common data elements to define the IMDs, report the clinical phenotype, describe the biochemical markers and to capture the drug treatment. Until today, more than 1100 IMD patients have been registered. Conclusion: The U-IMD registry is the first observational, non-interventional patient registry that encompasses all known IMDs. Full semantic interoperability for other registries has been achieved, as demonstrated by the use of a minimum common core data set for equivalent description of metabolic patients in U-IMD and in the patient registry of the European Rare Kidney Disease Reference Network (ERKNet). In conclusion, the U-IMD registry will contribute to a better understanding of the long-term course of IMDs and improved patients care by understanding the natural disease course and by enabling an optimization of diagnostic and therapeutic strategies.


2014 ◽  
Vol 22 (1) ◽  
pp. 76-85 ◽  
Author(s):  
Rémy Choquet ◽  
Meriem Maaroufi ◽  
Albane de Carrara ◽  
Claude Messiaen ◽  
Emmanuel Luigi ◽  
...  

Abstract Background Although rare disease patients make up approximately 6–8% of all patients in Europe, it is often difficult to find the necessary expertise for diagnosis and care and the patient numbers needed for rare disease research. The second French National Plan for Rare Diseases highlighted the necessity for better care coordination and epidemiology for rare diseases. A clinical data standard for normalization and exchange of rare disease patient data was proposed. The original methodology used to build the French national minimum data set (F-MDS-RD) common to the 131 expert rare disease centers is presented. Methods To encourage consensus at a national level for homogeneous data collection at the point of care for rare disease patients, we first identified four national expert groups. We reviewed the scientific literature for rare disease common data elements (CDEs) in order to build the first version of the F-MDS-RD. The French rare disease expert centers validated the data elements (DEs). The resulting F-MDS-RD was reviewed and approved by the National Plan Strategic Committee. It was then represented in an HL7 electronic format to maximize interoperability with electronic health records. Results The F-MDS-RD is composed of 58 DEs in six categories: patient, family history, encounter, condition, medication, and questionnaire. It is HL7 compatible and can use various ontologies for diagnosis or sign encoding. The F-MDS-RD was aligned with other CDE initiatives for rare diseases, thus facilitating potential interconnections between rare disease registries. Conclusions The French F-MDS-RD was defined through national consensus. It can foster better care coordination and facilitate determining rare disease patients’ eligibility for research studies, trials, or cohorts. Since other countries will need to develop their own standards for rare disease data collection, they might benefit from the methods presented here.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Thomas Opladen ◽  
Florian Gleich ◽  
Viktor Kozich ◽  
Maurizio Scarpa ◽  
Diego Martinelli ◽  
...  

Abstract Background Following the broad application of new analytical methods, more and more pathophysiological processes in previously unknown diseases have been elucidated. The spectrum of clinical presentation of rare inherited metabolic diseases (IMDs) is broad and ranges from single organ involvement to multisystemic diseases. With the aim of overcoming the limited knowledge about the natural course, current diagnostic and therapeutic approaches, the project has established the first unified patient registry for IMDs that fully meets the requirements of the European Infrastructure for Rare Diseases (ERDRI). Results In collaboration with the European Reference Network for Rare Hereditary Metabolic Disorders (MetabERN), the Unified European registry for Inherited Metabolic Diseases (U-IMD) was established to collect patient data as an observational, non-interventional natural history study. Following the recommendations of the ERDRI the U-IMD registry uses common data elements to define the IMDs, report the clinical phenotype, describe the biochemical markers and to capture the drug treatment. Until today, more than 1100 IMD patients have been registered. Conclusion The U-IMD registry is the first observational, non-interventional patient registry that encompasses all known IMDs. Full semantic interoperability for other registries has been achieved, as demonstrated by the use of a minimum common core data set for equivalent description of metabolic patients in U-IMD and in the patient registry of the European Rare Kidney Disease Reference Network (ERKNet). In conclusion, the U-IMD registry will contribute to a better understanding of the long-term course of IMDs and improved patients care by understanding the natural disease course and by enabling an optimization of diagnostic and therapeutic strategies.


Sign in / Sign up

Export Citation Format

Share Document