scholarly journals The Problem of Correction Diagnostic Errors in the Target Attribute With the Function of Rival Similarity

Author(s):  
I.A. Borisova ◽  
O.A. Kutnenko

The problem of outliers detection is one of the important problems in Data Mining of biomedical datasets particularly in case when there could be misclassified objects, caused by diagnostic pitfalls on a stage of a data collection. Occurrence of such objects complicates and slows down dataset processing, distorts and corrupts detected regularities, reduces their accuracy score. We propose the censoring algorithm which could detect misclassified objects after which they are either removed from the dataset or the class attribute of such objects is corrected. Correction procedure keeps the volume of the analyzed dataset as big as it is possible. Such quality is very useful in case of small datasets analysis, when every bit of information can be important. The base concept in the presented work is a measure of similarity of objects with its surroundings. To evaluate the local similarity of the object with its closest neighbors the ternary relative measure called the function of rival similarity (FRiS-function) is used. Mean of similarity values of all objects in the dataset gives us a notion of a class’s separability, how close objects from the same class are to each other and how far they are from the objects of the different classes (with the different diagnosis) in the attribute space. It is supposed misclassified objects are more similar to objects from rival classes, than their own class, so their elimination from the dataset, or the target attribute correction should increase data separability value. The procedure of filtering-correcting of misclassified objects is based on the observation of changes in the evaluation of data separability calculated before and after making corrections to the dataset. The censoring process continues until the inflection point of the separability function is reached. The proposed algorithm was tested on a wide range of model tasks of different complexity. Also it was tested on biomedical tasks such as Pima Indians Diabetes data set, Breast Cancer data set and Parkinson data set. On these tasks the censoring algorithm showed high misclassification sensitivity. Accuracy score increasing and data set volume preservation after censoring procedure proved our base assumptions and the effectiveness of the algorithm.

2017 ◽  
Vol 78 (6) ◽  
pp. 1021-1055 ◽  
Author(s):  
Wendy Johnson ◽  
Ian J. Deary ◽  
Thomas J. Bouchard

Most study samples show less variability in key variables than do their source populations due most often to indirect selection into study participation associated with a wide range of personal and circumstantial characteristics. Formulas exist to correct the distortions of population-level correlations created. Formula accuracy has been tested using simulated normally distributed data, but empirical data are rarely available for testing. We did so in a rare data set in which it was possible: the 6-Day Sample, a representative subsample of 1,208 from the Scottish Mental Survey 1947 of cognitive ability in 1936-born Scottish schoolchildren (70,805). 6-Day Sample participants completed a follow-up assessment in childhood and were re-recruited for study at age 77 years. We compared full 6-Day Sample correlations of early-life variables with those of the range-restricted correlations in the later-participating subsample, before and after adjustment for direct and indirect range restriction. Results differed, especially for two highly correlated cognitive tests; neither reproduced full-sample correlations well due to small deviations from normal distribution in skew and kurtosis. Maximum likelihood estimates did little better. To assess these results’ typicality, we simulated sample selection and made similar comparisons using the 42 cognitive ability tests administered to the Minnesota Study of Twins Reared Apart, with very similar results. We discuss problems in developing further adjustments to offset range-restriction distortions and possible approaches to solutions.


Author(s):  
I.A. Borisova ◽  
O.A. Kutnenko

The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering, both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm (FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis. The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.


2015 ◽  
Vol 54 (05) ◽  
pp. 455-460 ◽  
Author(s):  
M. Ganzinger ◽  
T. Muley ◽  
M. Thomas ◽  
P. Knaup ◽  
D. Firnkorn

Summary Objective: Joint data analysis is a key requirement in medical research networks. Data are available in heterogeneous formats at each network partner and their harmonization is often rather complex. The objective of our paper is to provide a generic approach for the harmonization process in research networks. We applied the process when harmonizing data from three sites for the Lung Cancer Phenotype Database within the German Center for Lung Research. Methods: We developed a spreadsheet-based solution as tool to support the harmonization process for lung cancer data and a data integration procedure based on Talend Open Studio. Results: The harmonization process consists of eight steps describing a systematic approach for defining and reviewing source data elements and standardizing common data elements. The steps for defining common data elements and harmonizing them with local data definitions are repeated until consensus is reached. Application of this process for building the phenotype database led to a common basic data set on lung cancer with 285 structured parameters. The Lung Cancer Phenotype Database was realized as an i2b2 research data warehouse. Conclusion: Data harmonization is a challenging task requiring informatics skills as well as domain knowledge. Our approach facilitates data harmonization by providing guidance through a uniform process that can be applied in a wide range of projects.


2018 ◽  
Author(s):  
Paul F. Harrison ◽  
Andrew D. Pattison ◽  
David R. Powell ◽  
Traude H. Beilharz

AbstractBackgroundA differential gene expression analysis may produce a set of significantly differentially expressed genes that is too large to easily investigate, so that a means of ranking genes by their biological interest level is desirable. The life-sciences have grappled with the abuse of p-values to rank genes for this purpose. As an alternative, a lower confidence bound on the magnitude of Log Fold Change (LFC) could be used to rank genes, but it has been unclear how to reconcile this with the need to perform False Discovery Rate (FDR) correction. The TREAT test of McCarthy and Smyth is a step in this direction, finding genes significantly exceeding a specified LFC threshold. Here we describe the use of test inversion on TREAT to present genes ranked by a confidence bound on the LFC, while still controlling FDR.ResultsTesting the Topconfects R package with simulated gene expression data shows the method outperforming current statistical approaches across a wide range of experiment sizes in the identification of genes with largest LFCs. Applying the method to a TCGA breast cancer data-set shows the method ranks some genes with large LFC higher than would traditional ranking by p-value. Importantly these two ranking methods lead to a different biological emphasis, in terms both of specific highly ranked genes and gene-set enrichment.ConclusionsThe choice of ranking method in differential expression analysis can affect the biological interpretation. The common default of ranking by p-value is implicitly by an effect size in which each gene is standardized to its own variability, rather than comparing genes on a common scale, which may not be appropriate. The Topconfects approach of presenting genes ranked by confident LFC effect size is a variation on the TREAT method with improved usability, removing the need to fine-tune a threshold parameter and removing the temptation to abuse p-values as a de-facto effect size.


2021 ◽  
Vol 108 (Supplement_7) ◽  
Author(s):  
Fatima Rahman ◽  
Ellen Copson ◽  
Alan Hales ◽  
David Rew

Abstract Background Breast neoplasia displays complex patterns of whole-of-life disease progression, which are difficult to study using legacy data systems. Our timeline- and episode-structured breast cancer data set of 20,000 records allows direct visualisation of the entire documentary record of every patient. The embedded data mining module permits research into a wide range of patient cohorts by pathology, treatment and outcome. Methods We selected the cohort of patients aged between 15 and 75 with HER-2 –ve and HER-2 +ve breast cancer who were treated with neoadjuvant chemotherapy (NAC), with or without anti-HER2 therapy between 2002 and 2019. We also studied the patterns and time intervals (in months) of disease progression and response to treatment from primary diagnosis, through loco-regional recurrence and distant metastasis to final outcome. Results Of 301 women with confirmed early stage breast cancer were treated with NAC over that time, 186 had HER2- and 115 had HER2+ tumours. The patterns and intervals of disease progression, as displayed on the Master Lifetrack, were mapped and measured for every patient. The proportions of patients with Her2+ve tumours receiving trastuzumab and analogues, and the tumour responses to treatment, were audited. The underlying data set was validated by review of the original records. Conclusions The whole-of-life timeline structured cancer data system introduces a new direction for clinical data visualisation, record management and user utility in surgical practice. This study validates the model as a tool for the better understanding of treatment effects and longitudinal behaviours in any selected range of cancer phenotypes.


2021 ◽  
Vol 11 ◽  
Author(s):  
Maoqing Lu ◽  
Sheng Qiu ◽  
Xianyao Jiang ◽  
Diguang Wen ◽  
Ronggui Zhang ◽  
...  

BackgroundIncreasing evidence has indicated that abnormal epigenetic factors such as RNA m6A modification, histone modification, DNA methylation, RNA binding proteins and transcription factors are correlated with hepatocarcinogenesis. However, it is unknown how epigenetic modification-associated genes contribute to the occurrence and clinical outcome of hepatocellular carcinoma (HCC). Thus, we constructed the epigenetic modification-associated models that may enhance the diagnosis and prognosis of HCC.MethodsIn this study, we focused on the clinical value of epigenetic modification-associated genes for HCC. Our gene expression data were collected from TCGA and HCC data sets from the GEO database to ensure the reliability of the data. Their functions were analyzed by bioinformatics methods. We used lasso regression, Support vector machine (SVM), logistic regression and Cox regression to construct the diagnostic and prognostic models. We also constructed a nomogram of the practicability of the above-mentioned prognostic model. The above results were verified in an independent liver cancer data set from the ICGC database and clinical samples. Furthermore, we carried out pan-cancer analysis to verify the specificity of the above model and screened a wide range of drug candidates.ResultsMany epigenetic modification-associated genes were significantly different in HCC and normal liver tissues. The gene signatures showed a good ability to predict the occurrence and survival of HCC patients, as verified by DCA and ROC curve analysis.ConclusionGene signatures based on epigenetic modification-associated genes can be used to identify the occurrence and prognosis of liver cancer.


2021 ◽  
Author(s):  
Hitesh Mistry

Radiotherapy has been striving to find markers of radiotherapy sensitivity for decades. In recent years the community has spent significant resources on exploring the wide range of omics data-sets to find that elusive perfect biomarker. One such candidate termed the Radiosensitivity Index, RSI for short, has been heavily publicized as a marker suitable for making dose-adjustments in the clinical setting. However, none of the analyses conducted, thus far, has assessed whether RSI explains enough of the outcome variance to elucidate a dose-response empirically. Here we re-analyze a pan-cancer data-set and find that RSI is no better than random chance at explaining outcome variance, overall survival times. For completeness, we then assessed whether RSI captured a sufficient amount of outcome variance to elucidate a dose-response, it did not. These results suggest that like the initial in-vitro analysis 12 years previously RSI is not a marker of radiotherapy sensitivity and is thus not fit to be used in any dose-adjustment algorithms.


GIS Business ◽  
2019 ◽  
Vol 14 (6) ◽  
pp. 96-104
Author(s):  
P. Sakthivel ◽  
S. Rajaswaminathan ◽  
R. Renuka ◽  
N. R.Vembu

This paper empirically discovered the inter-linkages between stock and crude oil prices before and after the subprime financial crisis 2008 by using Johansan co-integration and Granger causality techniques to explore both long and short- run relationships.  The whole data set of Nifty index, Nifty energy index, BSE Sensex, BSE energy index and oil prices are divided into two periods; before crisis (from February 15, 2005 to December31, 2007) and after crisis (from January 1, 2008 to December 31, 2018) are collected and analyzed. The results discovered that there is one-way causal relationship from crude oil prices to Nifty index, Nifty energy index, BSE Sensex and BSE energy index but not other way around in both periods. However, a bidirectional causality relationship between BSE Energy index and crude oil prices during post subprime financial crisis 2008. The co-integration results suggested that the absence of long run relationship between crude oil prices and market indices of BSE Sensex, BSE energy index, Nifty index and Nifty energy index before and after subprime financial crisis 2008.


2019 ◽  
Vol 16 (7) ◽  
pp. 808-817 ◽  
Author(s):  
Laxmi Banjare ◽  
Sant Kumar Verma ◽  
Akhlesh Kumar Jain ◽  
Suresh Thareja

Background: In spite of the availability of various treatment approaches including surgery, radiotherapy, and hormonal therapy, the steroidal aromatase inhibitors (SAIs) play a significant role as chemotherapeutic agents for the treatment of estrogen-dependent breast cancer with the benefit of reduced risk of recurrence. However, due to greater toxicity and side effects associated with currently available anti-breast cancer agents, there is emergent requirement to develop target-specific AIs with safer anti-breast cancer profile. Methods: It is challenging task to design target-specific and less toxic SAIs, though the molecular modeling tools viz. molecular docking simulations and QSAR have been continuing for more than two decades for the fast and efficient designing of novel, selective, potent and safe molecules against various biological targets to fight the number of dreaded diseases/disorders. In order to design novel and selective SAIs, structure guided molecular docking assisted alignment dependent 3D-QSAR studies was performed on a data set comprises of 22 molecules bearing steroidal scaffold with wide range of aromatase inhibitory activity. Results: 3D-QSAR model developed using molecular weighted (MW) extent alignment approach showed good statistical quality and predictive ability when compared to model developed using moments of inertia (MI) alignment approach. Conclusion: The explored binding interactions and generated pharmacophoric features (steric and electrostatic) of steroidal molecules could be exploited for further design, direct synthesis and development of new potential safer SAIs, that can be effective to reduce the mortality and morbidity associated with breast cancer.


Author(s):  
Eun-Young Mun ◽  
Anne E. Ray

Integrative data analysis (IDA) is a promising new approach in psychological research and has been well received in the field of alcohol research. This chapter provides a larger unifying research synthesis framework for IDA. Major advantages of IDA of individual participant-level data include better and more flexible ways to examine subgroups, model complex relationships, deal with methodological and clinical heterogeneity, and examine infrequently occurring behaviors. However, between-study heterogeneity in measures, designs, and samples and systematic study-level missing data are significant barriers to IDA and, more broadly, to large-scale research synthesis. Based on the authors’ experience working on the Project INTEGRATE data set, which combined individual participant-level data from 24 independent college brief alcohol intervention studies, it is also recognized that IDA investigations require a wide range of expertise and considerable resources and that some minimum standards for reporting IDA studies may be needed to improve transparency and quality of evidence.


Sign in / Sign up

Export Citation Format

Share Document