scholarly journals Validation of ethnicity in administrative hospital data in women giving birth in England: cohort study

BMJ Open ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. e051977
Author(s):  
Jennifer Elizabeth Jardine ◽  
Alissa Frémeaux ◽  
Megan Coe ◽  
Ipek Gurol Urganci ◽  
Dharmintra Pasupathy ◽  
...  

ObjectiveTo describe the accuracy of coding of ethnicity in National Health Service (NHS) administrative hospital records compared with self-declared records in maternity booking systems, and to assess the potential impact of misclassification bias.DesignSecondary analysis of data from records of women giving birth in England (2015–2017).SettingNHS Trusts in England participating in a national audit programme.Participants1 237 213 women who gave birth between 1 April 2015 and 31 March 2017.Primary and secondary outcome measures(1) Proportion of women with complete ethnicity; (2) agreement on coded ethnicity between maternity (maternity information systems (MIS)) and administrative hospital (Hospital Episode Statistics (HES)) records; (3) rates of caesarean section and obstetric anal sphincter injury by ethnic group in MIS and HES.Results91.3% of women had complete information regarding ethnicity in HES. Overall agreement between data sets was 90.4% (κ=0.83); 94.4% when collapsed into aggregate groups of white/South Asian/black/mixed/other (κ=0.86). Most disagreement was seen in women coded as mixed in either data set. Rates of obstetrical events and complications by ethnicity were similar regardless of data set used, with the most differences seen in women coded as mixed.ConclusionsLevels of accuracy in ethnicity coding in administrative hospital records support the use of ethnicity collapsed into groups (white/South Asian/black/mixed/other), but findings for mixed and other groups, and more granular classifications, should be treated with caution. Robustness of results of analyses for associations with ethnicity can be improved by using additional primary data sources.

Thorax ◽  
2017 ◽  
Vol 73 (4) ◽  
pp. 339-349 ◽  
Author(s):  
Margreet Lüchtenborg ◽  
Eva J A Morris ◽  
Daniela Tataru ◽  
Victoria H Coupland ◽  
Andrew Smith ◽  
...  

IntroductionThe International Cancer Benchmarking Partnership (ICBP) identified significant international differences in lung cancer survival. Differing levels of comorbid disease across ICBP countries has been suggested as a potential explanation of this variation but, to date, no studies have quantified its impact. This study investigated whether comparable, robust comorbidity scores can be derived from the different routine population-based cancer data sets available in the ICBP jurisdictions and, if so, use them to quantify international variation in comorbidity and determine its influence on outcome.MethodsLinked population-based lung cancer registry and hospital discharge data sets were acquired from nine ICBP jurisdictions in Australia, Canada, Norway and the UK providing a study population of 233 981 individuals. For each person in this cohort Charlson, Elixhauser and inpatient bed day Comorbidity Scores were derived relating to the 4–36 months prior to their lung cancer diagnosis. The scores were then compared to assess their validity and feasibility of use in international survival comparisons.ResultsIt was feasible to generate the three comorbidity scores for each jurisdiction, which were found to have good content, face and concurrent validity. Predictive validity was limited and there was evidence that the reliability was questionable.ConclusionThe results presented here indicate that interjurisdictional comparability of recorded comorbidity was limited due to probable differences in coding and hospital admission practices in each area. Before the contribution of comorbidity on international differences in cancer survival can be investigated an internationally harmonised comorbidity index is required.


2019 ◽  
Author(s):  
Pavlin G. Poličar ◽  
Martin Stražar ◽  
Blaž Zupan

AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When working with multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose data set-specific clusters. To circumvent these batch effects, we propose an embedding procedure that takes a t-SNE visualization constructed on a reference data set and uses it as a scaffold for embedding new data. The new, secondary data is embedded one data-point at the time. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach with an analysis of six recently published single-cell gene expression data sets containing up to tens of thousands of cells and thousands of genes. In these data sets, the batch effects are particularly strong as the data comes from different institutions and was obtained using different experimental protocols. The visualizations constructed by our proposed approach are cleared of batch effects, and the cells from secondary data sets correctly co-cluster with cells from the primary data sharing the same cell type.


Author(s):  
Alan J. Silman ◽  
Gary J. Macfarlane ◽  
Tatiana Macfarlane

Although epidemiological studies are increasingly based on the analysis of existing data sets (including linked data sets), many studies still require primary data collection. Such data may come from patient questionnaires, interviews, abstraction from records, and/or the results of tests and measures such as weight or blood test results. The next stage is to analyse the data gathered from individual subjects to provide the answers required. Before commencing with the statistical analysis of any data set, the data themselves must be prepared in a format so that the detailed statistical analysis can achieve its goals. Items to be considered include the format the data are initially collected in and how they are transferred to an appropriate electronic form. This chapter explores how errors are minimized and the quality of the data set ensured. These tasks are not trivial and need to be planned as part of a detailed study methodology.


2020 ◽  
Vol 36 (1) ◽  
pp. 89-115 ◽  
Author(s):  
Harvey Goldstein ◽  
Natalie Shlomo

AbstractThe requirement to anonymise data sets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this article is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set, in which the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects in order to interrogate their records in the data set. The second stage of the analysis consists of specifying the model of interest so that parameter estimation accounts for the added noise. Where the characteristics of the noise are made available to the analyst by the data provider, we propose a new method that allows a valid analysis. This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recovers consistent estimates of the true model parameters. A new method for handling categorical data is presented. The article shows how an appropriate noise distribution can be determined.


2017 ◽  
Vol 1 (4) ◽  
pp. 115-116
Author(s):  
Masoud Sotoudehfar ◽  
Zahra Mazloum Khorasani ◽  
Zahra Ebnehoseini ◽  
Kobra Etminani ◽  
Mahmoud Tara ◽  
...  

Introduction: The number of people with diabetes's increasing. More than 220 million people have diabetes, more than 70% of whom live in middle and lower-income countries. already exist many innovations around the world on improving the managed care of diabetes  .diabetes registries are one of them. in Iran, development and evaluation of diabetes information systems is one of the most research priorities. since defining health regulations and evaluation of diabetes prevention programs depend on the powerful information system, but in Iran don't exist complete information about incidence and prevalence of diabetes. determine standard data elements (Des) and design diabetes registry is one the most important country requirements. the main purpose of this study is investigating to this subject.   Methods: This is a descriptive- analytic study. Resource related to diabetes DEs collected from selective minimum data sets. Then diabetes DEs set derived from selective minimum data sets were investigated in focus group sessions with endocrine specialists, health informatics, and health information management. Duplicate DEs were removed and similar DEs were combined. Then seven endocrine specialists evaluated diabetes DEs set. They determine the value of each DEs using the Delphi technique (scores range from 0 to 5). The DEs that received more than 75% of grade 4 and 5 remained in the study. Following the expert opinion, the final version of the diabetes DEs set was designed.   Results: According to literature review 455 DEs included studying, after Delphi sessions, 293 data element remained to study. Main categories of DEs are:1-patient demographic characterizes (12 DEs), 2-patient referral (5 DEs), 3-diabetes care follow up (15 DEs), 4-physical exam, chief complaint and assessment (40 DEs), 5-history (such as: individual, grow up, family, drug abuse) (10 DEs), 6-pregnancy management (13 DEs), 7-screening (10 DEs), 8-specialty evolutions ( such as: cardiovascular (18 DEs), neuropathy (16 DEs), nephropathy (7 DEs), teeth and mouse (3 DEs), eyes (14 DEs), psychology situation (2 DEs),  sexual ability (1 DEs)), 9-laboratory exams (33 DEs), 10-drugs (oral antidiabetics drugs (14 DEs), injectable antidiabetics (7 DEs), lipid (11 DEs), hypertension (20 DEs), anti placates (2 DEs)), cardiac (3 DEs), preparing insulin method (5 DEs)), 11-physical activity (4 DEs),12- diet (12 DEs), 13-education and self care (13 DEs). Conclusion: In the study diabetes, DEs set were determined that provide appropriate yield for data gathering and record all required information for diabetes care. Hence diabetes is a chronic disease and Patients suffer from it for years, implementation diabetes DEs can improve documentation and improve diabetes care. 


2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Author(s):  
Greg Finak ◽  
Bryan T. Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2001 ◽  
Vol 16 (7) ◽  
pp. 400-405 ◽  
Author(s):  
A.P. Boardman ◽  
D. Healy

SummaryBackgroundThe lifetime risk of suicide in affective disorders is commonly quoted as 15%. This stems from hospital populations of affective disorders.AimsTo model the lifetime prevalence of suicide using data on completed suicides from one English Health District and community-based rates of prevalence of affective disorders.MethodsA secondary analysis of a primary data set based on 212 suicides in North Staffordshire was undertaken. The population rates of psychiatric morbidity were obtained from the National Comorbidity Survey.ResultsThe model suggests a lifetime prevalence rate of suicide for any affective disorder at 2.4%, with a rate for those uncomplicated by substance abuse, personality disorder or non-affective psychosis at 2.4%, and a rate for uncomplicated cases who had no mental health service contact at 1.1%.ConclusionsLifetime prevalence rates of suicide in subgroups of affective disorders may be lower than the traditional rates cited for hospital depression. This has implications for primary care projects designed to investigate the occurrence of and the prevention of suicide.


2021 ◽  
Author(s):  
Pavlin G. Poličar ◽  
Martin Stražar ◽  
Blaž Zupan

AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When jointly visualising multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose dataset-specific clusters. To circumvent these batch effects, we propose an embedding procedure that uses a t-SNE visualization constructed on a reference data set as a scaffold for embedding new data points. Each data instance from a new, unseen, secondary data is embedded independently and does not change the reference embedding. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach by analyzing six recently published single-cell gene expression data sets with up to tens of thousands of cells and thousands of genes. The batch effects in our studies are particularly strong as the data comes from different institutions using different experimental protocols. The visualizations constructed by our proposed approach are clear of batch effects, and the cells from secondary data sets correctly co-cluster with cells of the same type from the primary data. We also show the predictive power of our simple, visual classification approach in t-SNE space matches the accuracy of specialized machine learning techniques that consider the entire compendium of features that profile single cells.


Sign in / Sign up

Export Citation Format

Share Document