HextractoR: an R package for automatic extraction of hairpins from genome-wide data

AbstractExtracting stem-loop sequences (hairpins) from genome-wide data is very important nowadays for some data mining tasks in bioinformatics. The genome preprocessing is very important because it has a strong influence on the later steps and the final results. For example, for novel miRNA prediction, all well-known hairpins must be properly located. Although there are some scripts that can be adapted and put together to achieve this task, they are outdated, none of them guarantees finding correspondence to well-known structures in the genome under analysis, and they do not take advantage of the latest advances in secondary structure prediction. We present here an R package for automatic extraction of hairpins from genome-wide data (HextractorR). HextractoR makes an exhaustive and smart analysis of the genome in order to obtain a very good set of short sequences for further processing. Moreover, genomes can be processed in parallel and with low memory requirements. Results obtained showed that HextractoR has effectively outperformed other methods.HextractoR it is freely available at CRAN and Sourceforge.

Download Full-text

Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0030 ◽

2019 ◽

Vol 18 (3) ◽

Cited By ~ 1

Author(s):

Gail Gong ◽

Wei Wang ◽

Chih-Lin Hsieh ◽

David J. Van Den Berg ◽

Christopher Haiman ◽

...

Keyword(s):

Prostate Cancer ◽

R Package ◽

Suppressor Gene ◽

Test Statistic ◽

Specific Data ◽

Association Tests ◽

Association Testing ◽

Genome Wide ◽

Genome Wide Data ◽

Data Adaptive

Abstract Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.

Download Full-text

Qtlizer: comprehensive QTL annotation of GWAS results

Scientific Reports ◽

10.1038/s41598-020-75770-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Matthias Munz ◽

Inken Wohlers ◽

Eric Simon ◽

Tobias Reinberger ◽

Hauke Busch ◽

...

Keyword(s):

Association Studies ◽

Housekeeping Genes ◽

R Package ◽

Genome Wide Association Studies ◽

Protein Abundance ◽

Base Pairs ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Distance Limit

AbstractExploration of genetic variant-to-gene relationships by quantitative trait loci such as expression QTLs is a frequently used tool in genome-wide association studies. However, the wide range of public QTL databases and the lack of batch annotation features complicate a comprehensive annotation of GWAS results. In this work, we introduce the tool “Qtlizer” for annotating lists of variants in human with associated changes in gene expression and protein abundance using an integrated database of published QTLs. Features include incorporation of variants in linkage disequilibrium and reverse search by gene names. Analyzing the database for base pair distances between best significant eQTLs and their affected genes suggests that the commonly used cis-distance limit of 1,000,000 base pairs might be too restrictive, implicating a substantial amount of wrongly and yet undetected eQTLs. We also ranked genes with respect to the maximum number of tissue-specific eQTL studies in which a most significant eQTL signal was consistent. For the top 100 genes we observed the strongest enrichment with housekeeping genes (P = 2 × 10–6) and with the 10% highest expressed genes (P = 0.005) after grouping eQTLs by r2 > 0.95, underlining the relevance of LD information in eQTL analyses. Qtlizer can be accessed via https://genehopper.de/qtlizer or by using the respective Bioconductor R-package (https://doi.org/10.18129/B9.bioc.Qtlizer).

Download Full-text

Applications of Multifactor Dimensionality Reduction to Genome-Wide Data Using the R Package ‘MDR’

Methods in Molecular Biology - Genome-Wide Association Studies and Genomic Prediction ◽

10.1007/978-1-62703-447-0_23 ◽

2013 ◽

pp. 479-498 ◽

Cited By ~ 1

Author(s):

Stacey Winham

Keyword(s):

Dimensionality Reduction ◽

Multifactor Dimensionality Reduction ◽

R Package ◽

Genome Wide ◽

Genome Wide Data

Download Full-text

reactIDR: Evaluation of the statistical reproducibility of high-throughput structural analyses for a robust RNA reactivity classification

10.1101/275016 ◽

2018 ◽

Author(s):

Risa Kawaguchi ◽

Hisanori Kiryu ◽

Junichi Iwakiri ◽

Jun Sese

Keyword(s):

Experimental Data ◽

High Throughput ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Classification Problem ◽

Supplementary Information ◽

Dimensional Structure ◽

Data Generation ◽

Multiple Sources ◽

Stem Loop

AbstractMotivationRecently, next-generation sequencing techniques have been applied for the detection of RNA secondary structures called high-throughput RNA structural (HTS) analy- sis, and dozens of different protocols were used to detect comprehensive RNA structures at single-nucleotide resolution. However, the existing computational analyses heavily depend on experimental data generation methodology, which results in many difficulties associated with statistically sound comparisons or combining the results obtained using different HTS methods.ResultsHere, we introduced a statistical framework, reactIDR, which is applicable to the experimental data obtained using multiple HTS methodologies, and it classifies the nucleotides into three structural categories, stem, loop, and unmapped. reactIDR uses the irreproducible discovery rate (IDR) with a hidden Markov model (HMM) to discriminate accurately between the true and spurious signals obtained in the replicated HTS experiments. In reactIDR, IDR and HMM parameters are efficiently optimized by using an expectation-maximization algorithm. Furthermore, if known reference structures are given, a supervised learning can be applicable in a semi-supervised manner. The results of our analyses for real HTS data showed that reactIDR achieved the highest accuracy in the classification problem of stem/loop structures of rRNA using both individual and integrated HTS datasets as well as the best correspondence with the three-dimensional structure. Because reactIDR is the first method to compare HTS datasets obtained from multiple sources in a single unified model, it has a great potential to increase the accuracy of RNA secondary structure prediction at transcriptome-wide level with further experiments performed.AvailabilityreactIDR is implemented in Python. Source code is publicly available at https://github.com/carushi/reactIDRhttps://github.com/carushi/[email protected] informationSupplementary data are available at online.

Download Full-text

gwasurvivr: an R package for genome wide survival analysis

10.1101/326033 ◽

2018 ◽

Author(s):

Abbas A Rizvi ◽

Ezgi Karaesmen ◽

Martin Morgan ◽

Leah Preus ◽

Junke Wang ◽

...

Keyword(s):

Survival Analysis ◽

Cox Model ◽

R Package ◽

Supplementary Information ◽

Parameter Estimates ◽

Survival Analyses ◽

Link Type ◽

Genome Wide ◽

Size Number ◽

Simple Interface

ABSTRACTSummaryTo address the limited software options for performing survival analyses with millions of SNPs, we developed gwasurvivr, an R/Bioconductor package with a simple interface for conducting genome wide survival analyses using VCF (outputted from Michigan or Sanger imputation servers), IMPUTE2 or PLINK files. To decrease the number of iterations needed for convergence when optimizing the parameter estimates in the Cox model we modified the R package survival; covariates in the model are first fit without the SNP, and those parameter estimates are used as initial points. We benchmarked gwasurvivr with other software capable of conducting genome wide survival analysis (genipe, SurvivalGWAS_SV, and GWASTools). gwasurvivr is significantly faster and shows better scalability as sample size, number of SNPs and number of covariates increases.Availability and implementationgwasurvivr, including source code, documentation, and vignette are available at: http://bioconductor.org/packages/gwasurvivrContactAbbas Rizvi, [email protected]; Lara E Sucheston-Campbell, [email protected] information: Supplementary data are available at https://github.com/suchestoncampbelllab/gwasurvivr_manuscript

Download Full-text

fluff: exploratory analysis and visualization of high-throughput sequencing data

PeerJ ◽

10.7717/peerj.2209 ◽

2016 ◽

Vol 4 ◽

pp. e2209 ◽

Cited By ~ 28

Author(s):

Georgios Georgiou ◽

Simon J. van Heeringen

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Developmental Stages ◽

Command Line ◽

Clustering Methods ◽

Sequencing Data ◽

Link Type ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

Genome Wide Data

Summary.In this article we describe fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing data mapped to a reference genome. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different conditions or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available athttp://fluff.readthedocs.org.Availability.fluff is implemented in Python and runs on Linux. The source code is freely available for download athttps://github.com/simonvh/fluff.

Download Full-text

RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms

PeerJ Computer Science ◽

10.7717/peerj-cs.251 ◽

2020 ◽

Vol 6 ◽

pp. e251 ◽

Cited By ~ 17

Author(s):

Zhaodong Hao ◽

Dekang Lv ◽

Ying Ge ◽

Jisen Shi ◽

Dolf Weijers ◽

...

Keyword(s):

Gc Content ◽

R Package ◽

Whole Genome ◽

Data Mapping ◽

Data Types ◽

Model Species ◽

Chromosomal Distribution ◽

Whole Genome Analysis ◽

Genome Wide ◽

Genome Wide Data

Background Owing to the rapid advances in DNA sequencing technologies, whole genome from more and more species are becoming available at increasing pace. For whole-genome analysis, idiograms provide a very popular, intuitive and effective way to map and visualize the genome-wide information, such as GC content, gene and repeat density, DNA methylation distribution, genomic synteny, etc. However, most available software programs and web servers are available only for a few model species, such as human, mouse and fly, or have limited application scenarios. As more and more non-model species are sequenced with chromosome-level assembly being available, tools that can generate idiograms for a broad range of species and be capable of visualizing more data types are needed to help better understanding fundamental genome characteristics. Results The R package RIdeogram allows users to build high-quality idiograms of any species of interest. It can map continuous and discrete genome-wide data on the idiograms and visualize them in a heat map and track labels, respectively. Conclusion The visualization of genome-wide data mapping and comparison allow users to quickly establish a clear impression of the chromosomal distribution pattern, thus making RIdeogram a useful tool for any researchers working with omics.

Download Full-text

Sequence Analysis Primer

10.1093/oso/9780195098747.001.0001 ◽

1995 ◽

Keyword(s):

Sequence Analysis ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Protein Secondary Structure ◽

Formal Training ◽

Stem Loop ◽

Data Manipulation ◽

Hands On ◽

New Gene ◽

Fine Tune

Computerized sequence analysis is an integral part of biotechnological research, yet many biologists have received no formal training in this important technology. Sequence Analysis Primer offers the beginner the necessary background to enter this vital field and helps more seasoned researchers to fine-tune their approach. It covers basic data manipulation such as homology searches, stem-loop identification, and protein secondary structure prediction, and is compatible with most sequence analysis programs. A detailed example giving steps for characterizing a new gene sequence provides users with hands-on experience when combined with their current software. The book will be invaluable to researchers and students in molecular biology, genetics, biochemistry, microbiology, and biotechnology.

Download Full-text

Correlation AnalyzeR: functional predictions from gene co-expression correlations

BMC Bioinformatics ◽

10.1186/s12859-021-04130-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Henry E. Miller ◽

Alexander J. R. Bishop

Keyword(s):

Web Application ◽

Bone Cancer ◽

R Package ◽

Limited Range ◽

Web Interface ◽

Generating Functional ◽

Link Type ◽

Genome Wide ◽

User Friendly

Abstract Background Co-expression correlations provide the ability to predict gene functionality within specific biological contexts, such as different tissue and disease conditions. However, current gene co-expression databases generally do not consider biological context. In addition, these tools often implement a limited range of unsophisticated analysis approaches, diminishing their utility for exploring gene functionality and gene relationships. Furthermore, they typically do not provide the summary visualizations necessary to communicate these results, posing a significant barrier to their utilization by biologists without computational skills. Results We present Correlation AnalyzeR, a user-friendly web interface for exploring co-expression correlations and predicting gene functions, gene–gene relationships, and gene set topology. Correlation AnalyzeR provides flexible access to its database of tissue and disease-specific (cancer vs normal) genome-wide co-expression correlations, and it also implements a suite of sophisticated computational tools for generating functional predictions with user-friendly visualizations. In the usage example provided here, we explore the role of BRCA1-NRF2 interplay in the context of bone cancer, demonstrating how Correlation AnalyzeR can be effectively implemented to generate and support novel hypotheses. Conclusions Correlation AnalyzeR facilitates the exploration of poorly characterized genes and gene relationships to reveal novel biological insights. The database and all analysis methods can be accessed as a web application at https://gccri.bishop-lab.uthscsa.edu/correlation-analyzer/ and as a standalone R package at https://github.com/Bishop-Laboratory/correlationAnalyzeR.

Download Full-text

RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms

10.7287/peerj.preprints.27928v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhaodong Hao ◽

Dekang Lv ◽

Ying Ge ◽

Jisen Shi ◽

Dolf Weijers ◽

...

Keyword(s):

Gc Content ◽

R Package ◽

Whole Genome ◽

Data Mapping ◽

Model Species ◽

Chromosomal Distribution ◽

Whole Genome Analysis ◽

Sequencing Technologies ◽

Genome Wide ◽

Genome Wide Data

Background: Owing to the rapid advances in DNA sequencing technologies, whole genome from more and more species are becoming available at increasing pace. For whole-genome analysis, idiograms provide a very popular, intuitive and effective way to map and visualize the genome-wide information, such as GC content, gene and repeat density, DNA methylation distribution, etc. However, most available software programs and web servers are available only for a few model species, such as human, mouse and fly. As boundaries between model and non-model species are shifting, tools are urgently needs to generate idiograms for a broad range of species are needed to help better understanding fundamental genome characteristics. Results: The R package RIdeogram allows users to build high-quality idiograms of any species of interest. It can map continuous and discrete genome-wide data on the idiograms and visualize them in a heat map and track labels, respectively. Conclusion: The visualization of genome-wide data mapping and comparison allow users to quickly establish a clear impression of the chromosomal distribution pattern, thus making RIdeogram a useful tool for any researchers working with omics.

Download Full-text