Pattern Discovery from Biological Data

2012 ◽  
pp. 724-768
Author(s):  
Jesmin Nahar ◽  
Kevin S. Tickle ◽  
A. B.M. Shawkat Ali

Extracting useful information from structured and unstructured biological data is crucial in the health industry. Some examples include medical practitioner’s need to identify breast cancer patient in the early stage, estimate survival time of a heart disease patient, or recognize uncommon disease characteristics which suddenly appear. Currently there is an explosion in biological data available in the data bases. But information extraction and true open access to data are require time to resolve issues such as ethical clearance. The emergence of novel IT technologies allows health practitioners to facilitate the comprehensive analyses of medical images, genomes, transcriptomes, and proteomes in health and disease. The information that is extracted from such technologies may soon exert a dramatic change in the pace of medical research and impact considerably on the care of patients. The current research will review the existing technologies being used in heart and cancer research. Finally this research will provide some possible solutions to overcome the limitations of existing technologies. In summary the primary objective of this research is to investigate how existing modern machine learning techniques (with their strength and limitations) are being used in the indent of heartbeat related disease and the early detection of cancer in patients. After an extensive literature review these are the objectives chosen: to develop a new approach to find the association between diseases such as high blood pressure, stroke and heartbeat, to propose an improved feature selection method to analyze huge images and microarray databases for machine learning algorithms in cancer research, to find an automatic distance function selection method for clustering tasks, to discover the most significant risk factors for specific cancers, and to determine the preventive factors for specific cancers that are aligned with the most significant risk factors. Therefore we propose a research plan to attain these objectives within this chapter. The possible solutions of the above objectives are: new heartbeat identification techniques show promising association with the heartbeat patterns and diseases, sensitivity based feature selection methods will be applied to early cancer patient classification, meta learning approaches will be adopted in clustering algorithms to select an automatic distance function, and Apriori algorithm will be applied to discover the significant risks and preventive factors for specific cancers. We expect this research will add significant contributions to the medical professional to enable more accurate diagnosis and better patient care. It will also contribute in other area such as biomedical modeling, medical image analysis and early diseases warning.

Author(s):  
Jesmin Nahar ◽  
Kevin S. Tickle ◽  
A. B.M. Shawkat Ali

Extracting useful information from structured and unstructured biological data is crucial in the health industry. Some examples include medical practitioner’s need to identify breast cancer patient in the early stage, estimate survival time of a heart disease patient, or recognize uncommon disease characteristics which suddenly appear. Currently there is an explosion in biological data available in the data bases. But information extraction and true open access to data are require time to resolve issues such as ethical clearance. The emergence of novel IT technologies allows health practitioners to facilitate the comprehensive analyses of medical images, genomes, transcriptomes, and proteomes in health and disease. The information that is extracted from such technologies may soon exert a dramatic change in the pace of medical research and impact considerably on the care of patients. The current research will review the existing technologies being used in heart and cancer research. Finally this research will provide some possible solutions to overcome the limitations of existing technologies. In summary the primary objective of this research is to investigate how existing modern machine learning techniques (with their strength and limitations) are being used in the indent of heartbeat related disease and the early detection of cancer in patients. After an extensive literature review these are the objectives chosen: to develop a new approach to find the association between diseases such as high blood pressure, stroke and heartbeat, to propose an improved feature selection method to analyze huge images and microarray databases for machine learning algorithms in cancer research, to find an automatic distance function selection method for clustering tasks, to discover the most significant risk factors for specific cancers, and to determine the preventive factors for specific cancers that are aligned with the most significant risk factors. Therefore we propose a research plan to attain these objectives within this chapter. The possible solutions of the above objectives are: new heartbeat identification techniques show promising association with the heartbeat patterns and diseases, sensitivity based feature selection methods will be applied to early cancer patient classification, meta learning approaches will be adopted in clustering algorithms to select an automatic distance function, and Apriori algorithm will be applied to discover the significant risks and preventive factors for specific cancers. We expect this research will add significant contributions to the medical professional to enable more accurate diagnosis and better patient care. It will also contribute in other area such as biomedical modeling, medical image analysis and early diseases warning.


2021 ◽  
Vol 17 ◽  
Author(s):  
Md. Merajul Islam ◽  
Md. Jahanur Rahman ◽  
Dulal Chandra Roy ◽  
Md. Moidul Islam ◽  
Most. Tawabunnahar ◽  
...  

Background: Anemia is a major public health problem with raising its prevalence worldwide including Bangladesh. Objectives: To identify the risk factors of anemia among women in Bangladesh and its prediction using machine learning (ML) based techniques. Methods: The anemia dataset, comprising of 3,020 respondents, was extracted from the Bangladesh demographic and health survey (BDHS). Two feature selection techniques as logistic regression (LR) and random forest (RF) have been utilized to determine the risk factors of anemia. Additionally, eight ML-based techniques, namely LR, linear discriminant analysis (LDA), K-nearest neighborhood (KNN), support vector machine (SVM), quadratic discriminant analysis (QDA), neural network (NN), classification and regression tree (CART), and RF have been also utilized to predict anemia disease among women in Bangladesh. Classification accuracy and area under the curve (AUC) are used to evaluate the performances of these classifiers. Results: LR and RF-based feature selection results indicate that out of 15 factors, 13 for LR and 14 factors for RF appear to be significant risk factors for anemia among women. All predictive models provide the highest classification accuracy and AUC from 74.10-81.29% and 0.744-0.819 under RF features. However, the combination of RF-based feature selection along with RF-based classifier gives the highest classification accuracy (81.29%) and AUC (0.819). Conclusion: Out of eight predictive models, the RF-RF based combination model shows the best performance for the prediction of anemia. This study suggests policymakers to make appropriate decisions to control the anemia using these mentioned combinations to save time and reduce the cost for Bangladeshi women.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii158-ii158
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
James Fink ◽  
David Haynor ◽  
Eric Holland ◽  
...  

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
Abdullah Feroze ◽  
Eric C Holland ◽  
Linda Shapiro ◽  
...  

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.


Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 2338-2338
Author(s):  
Lena Coïc ◽  
Suzanne Verlhac ◽  
Emmanuelle Lesprit ◽  
Emmanuelle Fleurence ◽  
Francoise Bernaudin

Abstract Abnormal TCD defined as high mean maximum velocities > 200 cm/sec are highly predictive of stroke risk and justify long term transfusion program. Outcome and risk factors of conditional TCD defined as velocities 170–200 cm/sec remains to be described. Patients and methods Since 1992, 371 pediatric SCD patients (303 SS, 44 SC, 18 Sß+, 6 Sß0) were systematically explored once a year by TCD. The newborn screened cohort (n=174) had the first TCD exploration between 12 and 18 months of age. TCD was performed with a real-time imaging unit, using a 2 MHz sector transducer with color Doppler capabilities. Biological data were assessed at baseline, after the age of 1.5 years and remotely of transfusion or VOC. We report the characteristics and the outcome in patients (n=43) with an history of conditional TCD defined by mean maximum velocities ranging between 170 and 200 cm/s in the ACM, the ACA or the ICA. Results: The mean follow-up of TCD monitoring was 5,5 years (0 – 11,8 y). All patients with an history of conditional doppler were SS/Sb0 (n=43). Mean (SD) age of patients at the time of their first conditional TCD was 4.3 years (2.2) whereas in our series the mean age at abnormal TCD (> 200 cm/sec) occurrence was 6.6 years (3.2). Comparison of basal parameters showed highly significant differences between patients with conditional TCD and those with normal TCD: Hb 7g4 vs 8g5 (p<0.001), MCV 82.8 vs 79 (p=0.047). We also had found such differences between patients with normal and those with abnormal TCD (Hb and MCV p< 0.001). Two patients were lost of follow-up. Two patients died during a trip to Africa. Conditional TCD became abnormal in 11/43 patients and justified transfusion program. Mean (SD) conversion delay was 1.8 (2.0) years (range 0.5–7y). No stroke occurred. 16 patients required a treatment intensification for other indications (frequent VOC/ACS, splenic sequestrations): 6 were transplanted and 10 received HU or TP. Significant risk factors (Pearson) of conversion to abnormal were the age at time of conditional TCD occurrence < 3 y (p<0.001), baseline Hb < 7g/dl (p=0.02) and MCV > 80 (p=0.04). MRI/MRA was performed in 31/43 patients and showed ischemic lesions in 5 of them at the mean (SD) age of 7.1 y (1.8) (range 4.5–8.9): no significant difference was observed in the occurrence of lesions between the 2 groups. Conclusions This study confirms the importance of age as predictive factor of conditional to abnormal TCD conversion with a risk of 64% when first conditional TCD occured before the age of 3 years. TCD has to be frequently controled during the 5 first years of life.


Sign in / Sign up

Export Citation Format

Share Document