scholarly journals oCEM: Automatic detection and analysis of overlapping co-expressed gene modules

BMC Genomics ◽  
2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Quang-Huy Nguyen ◽  
Duc-Hau Le

Abstract Background When it comes to the co-expressed gene module detection, its typical challenges consist of overlap between identified modules and local co-expression in a subset of biological samples. The nature of module detection is the use of unsupervised clustering approaches and algorithms. Those methods are advanced undoubtedly, but the selection of a certain clustering method for sample- and gene-clustering tasks is separate, in which the latter task is often more complicated. Results This study presented an R-package, Overlapping CoExpressed gene Module (oCEM), armed with the decomposition methods to solve the challenges above. We also developed a novel auxiliary statistical approach to select the optimal number of principal components using a permutation procedure. We showed that oCEM outperformed state-of-the-art techniques in the ability to detect biologically relevant modules additionally. Conclusions oCEM helped non-technical users easily perform complicated statistical analyses and then gain robust results. oCEM and its applications, along with example data, were freely provided at https://github.com/huynguyen250896/oCEM.

2021 ◽  
Author(s):  
Quang-Huy Nguyen ◽  
Duc-Hau Le

When it comes to the co-expressed gene module detection, its typical challenges consist of overlap between identified modules and local co-expression in a subset of biological samples. A recent study have reported that the decomposition methods are the most appropriate ones for solving these challenges. In this study, we represent a R tool, termed overlapping co-expressed gene module (overlappingCGM), which possesses those methods with a wholly automatic analysis framework to help non-technical users to easily perform complicated statistical analyses and then gain robust results. We also develop a novel auxiliary statistical approach to select the optimal number of principle components using a permutation procedure. Two example datasets are used, related to human breast cancer and mouse metabolic syndrome, to enable the illustration of the straightforward use of the tool. Computational experiment results show that overlappingCGM outperforms state-of-the-art techniques. The R scripts used in the study, including all information on the tool and its usage are made publicly available at https://github.com/huynguyen250896/overlappingCGM.


Author(s):  
Samarendra Das ◽  
Shesh N. Rai

Selection of biologically relevant genes from high dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was done on a single high-dimensional expression data, which leads to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining Support Vector Machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes are selected through statistical significance values computed using a non-parametric test statistic under a bootstrap based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e. subject classification, biological relevant criteria based on quantitative trait loci, and gene ontology. Our analytical results showed that the proposed approach selects genes that are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter, and wrapper methods of gene selection.


Entropy ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. 1205
Author(s):  
Samarendra Das ◽  
Shesh N. Rai

Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gulden Olgun ◽  
Afshan Nabi ◽  
Oznur Tastan

Abstract Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and https://github.com/guldenolgun/NoRCE.


Sensors ◽  
2021 ◽  
Vol 21 (10) ◽  
pp. 3311
Author(s):  
Riccardo Ballarini ◽  
Marco Ghislieri ◽  
Marco Knaflitz ◽  
Valentina Agostini

In motor control studies, the 90% thresholding of variance accounted for (VAF) is the classical way of selecting the number of muscle synergies expressed during a motor task. However, the adoption of an arbitrary cut-off has evident drawbacks. The aim of this work is to describe and validate an algorithm for choosing the optimal number of muscle synergies (ChoOSyn), which can overcome the limitations of VAF-based methods. The proposed algorithm is built considering the following principles: (1) muscle synergies should be highly consistent during the various motor task epochs (i.e., remaining stable in time), (2) muscle synergies should constitute a base with low intra-level similarity (i.e., to obtain information-rich synergies, avoiding redundancy). The algorithm performances were evaluated against traditional approaches (threshold-VAF at 90% and 95%, elbow-VAF and plateau-VAF), using both a simulated dataset and a real dataset of 20 subjects. The performance evaluation was carried out by analyzing muscle synergies extracted from surface electromyographic (sEMG) signals collected during walking tasks lasting 5 min. On the simulated dataset, ChoOSyn showed comparable performances compared to VAF-based methods, while, in the real dataset, it clearly outperformed the other methods, in terms of the fraction of correct classifications, mean error (ME), and root mean square error (RMSE). The proposed approach may be beneficial to standardize the selection of the number of muscle synergies between different research laboratories, independent of arbitrary thresholds.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xinyu Li ◽  
Wei Zhang ◽  
Jianming Zhang ◽  
Guang Li

Abstract Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.


2021 ◽  
Vol 263 (1) ◽  
pp. 4955-4961
Author(s):  
Mathieu Gontier ◽  
Barbara Romeyns

In industry segments such as automotive and industrial equipment the use of compressed porous materials is well known to improve the global acoustic performance of the complete system. Such porous materials should be designed in a specific way in order to reach a significant acoustic sealing performance at different compression rates. Unfortunately, there are no standard measurement procedures nor predefined material characteristics that allow the selection of the right material with the optimal acoustic performance. The main goal of this research is to link acoustic performance of compressed porous materials with intrinsic material characteristics using statistical techniques.


2015 ◽  
Vol 1 (311) ◽  
Author(s):  
Piotr Tarka

Abstract: The objective article is the comparative analysis of Likert rating scale based on the following range of response categories, i.e. 5, 7, 9 and 11 in context of the appropriate process of factors extraction in exploratory factor analysis (EFA). The problem which is being addressed in article relates primarily to the methodological aspects, both in selection of the optimal number of response categories of the measured items (constituting the Likert scale) and identification of possible changes, differences or similarities associated (as a result of the impact of four types of scales) with extraction and determination the appropriate number of factors in EFA model.Keywords: Exploratory factor analysis, Likert scale, experiment research, marketing


2021 ◽  
pp. 1-16
Author(s):  
Aikaterini Karanikola ◽  
Charalampos M. Liapis ◽  
Sotiris Kotsiantis

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.


Sign in / Sign up

Export Citation Format

Share Document