monte carlo feature selection
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 9)

H-INDEX

7
(FIVE YEARS 2)

2021 ◽  
Vol 22 (21) ◽  
pp. 11591
Author(s):  
Teresa Szczepińska ◽  
Ayatullah Faruk Mollah ◽  
Dariusz Plewczynski

The nature of genome organization into two basic structural compartments is as yet undiscovered. However, it has been indicated to be a mechanism of gene expression regulation. Using the classification approach, we ranked genomic marks that hint at compartmentalization. We considered a broad range of marks, including GC content, histone modifications, DNA binding proteins, open chromatin, transcription and genome regulatory segmentation in GM12878 cells. Genomic marks were defined over CTCF or RNAPII loops, which are basic elements of genome 3D structure, and over 100 kb genomic windows. Experiments were carried out to empirically assess the whole set of features, as well as the individual features in classification of loops/windows, into compartment A or B. Using Monte Carlo Feature Selection and Analysis of Variance, we constructed a ranking of feature importance for classification. The best simple indicator of compartmentalization is DNase-seq open chromatin measurement for CTCF loops, H3K4me1 for RNAPII loops and H3K79me2 for genomic windows. Among DNA binding proteins, this is RUNX3 transcription factor for loops and RNAPII for genomic windows. Chromatin state prediction methods that indicate active elements like promoters, enhancers or heterochromatin enhance the prediction of loop segregation into compartments. However, H3K9me3, H4K20me1, H3K27me3 histone modifications and GC content poorly indicate compartments.


Author(s):  
Lei Chen ◽  
Xianchao Zhou ◽  
Tao Zeng ◽  
Xiaoyong Pan ◽  
Yu-Hang Zhang ◽  
...  

Cancer has been generally defined as a cluster of systematic malignant pathogenesis involving abnormal cell growth. Genetic mutations derived from environmental factors and inherited genetics trigger the initiation and progression of cancers. Although several well-known factors affect cancer, mutation features and rules that affect cancers are relatively unknown due to limited related studies. In this study, a computational investigation on mutation profiles of cancer samples in 27 types was given. These profiles were first analyzed by the Monte Carlo Feature Selection (MCFS) method. A feature list was thus obtained. Then, the incremental feature selection (IFS) method adopted such list to extract essential mutation features related to 27 cancer types, find out 207 mutation rules and construct efficient classifiers. The top 37 mutation features corresponding to different cancer types were discussed. All the qualitatively analyzed gene mutation features contribute to the distinction of different types of cancers, and most of such mutation rules are supported by recent literature. Therefore, our computational investigation could identify potential biomarkers and prediction rules for cancers in the mutation signature level.


Author(s):  
Jin-Fan Li ◽  
Xiao-Jing Ma ◽  
Lin-Lin Ying ◽  
Ying-hui Tong ◽  
Xue-ping Xiang

Acute lymphoblastic leukemia (ALL) as a common cancer is a heterogeneous disease which is mainly divided into BCP-ALL and T-ALL, accounting for 80–85% and 15–20%, respectively. There are many differences between BCP-ALL and T-ALL, including prognosis, treatment, drug screening, gene research and so on. In this study, starting with methylation and gene expression data, we analyzed the molecular differences between BCP-ALL and T-ALL and identified the multi-omics signatures using Boruta and Monte Carlo feature selection methods. There were 7 expression signature genes (CD3D, VPREB3, HLA-DRA, PAX5, BLNK, GALNT6, SLC4A8) and 168 methylation sites corresponding to 175 methylation signature genes. The overall accuracy, accuracy of BCP-ALL, accuracy of T-ALL of the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) classifier using these signatures evaluated with 10-fold cross validation repeated 3 times were 0.973, 0.990, and 0.933, respectively. Two overlapped genes between 175 methylation signature genes and 7 expression signature genes were CD3D and VPREB3. The network analysis of the methylation and expression signature genes suggested that their common gene, CD3D, was not only different on both methylation and expression levels, but also played a key regulatory role as hub on the network. Our results provided insights of understanding the underlying molecular mechanisms of ALL and facilitated more precision diagnosis and treatment of ALL.


2020 ◽  
Vol 11 ◽  
Author(s):  
Yu-Hang Zhang ◽  
Zhandong Li ◽  
Tao Zeng ◽  
Xiaoyong Pan ◽  
Lei Chen ◽  
...  

Glioblastoma, also called glioblastoma multiform (GBM), is the most aggressive cancer that initiates within the brain. GBM is produced in the central nervous system. Cancer cells in GBM are similar to stem cells. Several different schemes for GBM stratification exist. These schemes are based on intertumoral molecular heterogeneity, preoperative images, and integrated tumor characteristics. Although the formation of glioblastoma is remarkably related to gene methylation, GBM has been poorly classified by epigenetics. To classify glioblastoma subtypes on the basis of different degrees of genes’ methylation, we adopted several powerful machine learning algorithms to identify numerous methylation features (sites) associated with the classification of GBM. The features were first analyzed by an excellent feature selection method, Monte Carlo feature selection (MCFS), resulting in a feature list. Then, such list was fed into the incremental feature selection (IFS), incorporating one classification algorithm, to extract essential sites. These sites can be annotated onto coding genes, such as CXCR4, TBX18, SP5, and TMEM22, and enriched in relevant biological functions related to GBM classification (e.g., subtype-specific functions). Representative functions, such as nervous system development, intrinsic plasma membrane component, calcium ion binding, systemic lupus erythematosus, and alcoholism, are potential pathogenic functions that participate in the initiation and progression of glioblastoma and its subtypes. With these sites, an efficient model can be built to classify the subtypes of glioblastoma.


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Bing Hu ◽  
Yun Li ◽  
Guilian Wang ◽  
Yanqing Zhang

Kawasaki disease (KD) is an acute vasculitis, accompanied by coronary artery aneurysm, coronary artery dilatation, arrhythmia, and other serious cardiovascular diseases. So far, the etiology of KD is unclear; it is necessary to study the molecular mechanism and related factors of KD. In this study, we analyzed the expression profiles of 75 DB (identifying bacteria), 122 DV (identifying virus), 71 HC (healthy control), and 311 KD (Kawasaki disease) samples. 332 key genes related to KD and pathogen infections were identified using a combination of advanced feature selection methods: (1) Boruta, (2) Monte-Carlo Feature Selection (MCFS), and (3) Incremental Feature Selection (IFS). The number of signature genes was narrowed down step by step. Subsequently, their functions were revealed by KEGG and GO enrichment analyses. Our results provided clues of potential molecular mechanisms of KD and were helpful for KD detection and treatment.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Rizwan Niaz ◽  
Ibrahim M. Almanjahie ◽  
Zulfiqar Ali ◽  
Muhammad Faisal ◽  
Ijaz Hussain

Spatial distribution of meteorological stations has a significant role in hydrological research. The meteorological data play a significant role in drought monitoring; in this regard, accurate and suitable provision of meteorological stations is becoming crucial to improve and strengthen the skill of drought prediction. In this perspective, the choice of meteorological stations in a specific region has substantial importance for accurate estimation and continuous monitoring of drought hazards at the regional level. However, installation and data mining on a large number of meteorological stations require high cost and resources. Therefore, it is necessary to rank and find dependencies among existing meteorological stations in a particular region for further climatological analysis and reanalysis of databases. In this paper, the Monte Carlo feature selection and interdependency discovery (MCFS-ID) algorithm-based framework is proposed to identify the important meteorological station in a particular region. We applied the proposed framework on 12 meteorological stations situated in varying climatological regions of Punjab (Pakistan). We employed the drought index SPTI on 1-, 3-, 6-, 9-, 12-, 24-, and 48-month time-scale data to find the interdependencies among meteorological stations at various locations. We found that Sialkot has significance regional importance for studying SPTI-3, SPTI-6, and SPTI-48 indices. This regional importance is based on scores of relative importance (RI); for example, the RI values for SPTI-3, SPTI-6, and SPTI-48 indices are 0.1570, 0.1080, and 0.0270, respectively. Furthermore, the Jhelum station has more relative importance (RI = 0.1410 and 0.1030) for SPTI-1 and SPTI-9 indices, while varying concentration behaviour is observed in the remaining time scales.


Sign in / Sign up

Export Citation Format

Share Document