ReGSP: a visualized application for homology-based gene searching and plotting using multiple reference sequences

The massively parallel nature of next-generation sequencing technologies has contributed to the generation of massive sequence data in the last two decades. Deciphering the meaning of each generated sequence requires multiple analysis tools, at all stages of analysis, from the reads stage all the way up to the whole-genome level. Homology-based approaches based on related reference sequences are usually the preferred option for gene and transcript prediction in newly sequenced genomes, resulting in the popularity of a variety of BLAST and BLAST-based tools. For organelle genomes, a single-reference–based gene finding tool that uses grouping parameters for BLAST results has been implemented in the Genome Search Plotter (GSP). However, this tool does not accept multiple and user-customized reference sequences required for a broad homology search. Here, we present multiple Reference–based Gene Search and Plot (ReGSP), a simple and convenient web tool that accepts multiple reference sequences for homology-based gene search. The tool incorporates cPlot, a novel dot plot tool, for illustrating nucleotide sequence similarity between the query and the reference sequences. ReGSP has an easy-to-use web interface and is freely accessible at https://ds.mju.ac.kr/regsp.

Download Full-text

On the optimal trimming of high-throughput mRNA sequence data

10.1101/000422 ◽

2013 ◽

Cited By ~ 1

Author(s):

Matthew D. MacManes

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequence Data ◽

Deep Understanding ◽

Sequencing Technologies ◽

Software Packages ◽

Sequencing Quality ◽

Functional Biology ◽

Genome Level ◽

Optimal Strength

AbstractThe widespread and rapid adoption of high-throughput sequencing technologies has afforded researchers the opportunity to gain a deep understanding of genome level processes that underlie evolutionary change, and perhaps more importantly, the links between genotype and phenotype. In particular, researchers interested in functional biology and adaptation have used these technologies to sequence mRNA transcriptomes of specific tissues, which in turn are often compared to other tissues, or other individuals with different phenotypes. While these techniques are extremely powerful, careful attention to data quality is required. In particular, because high-throughput sequencing is more error-prone than traditional Sanger sequencing, quality trimming of sequence reads should be an important step in all data processing pipelines. While several software packages for quality trimming exist, no general guidelines for the specifics of trimming have been developed. Here, using empirically derived sequence data, I provide general recommendations regarding the optimal strength of trimming, specifically in mRNA-Seq studies. Although very aggressive quality trimming is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose Phred score <2 or <5, is optimal for most studies across a wide variety of metrics.

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text

DNA Methylation in Solid Tumors: Functions and Methods of Detection

International Journal of Molecular Sciences ◽

10.3390/ijms22084247 ◽

2021 ◽

Vol 22 (8) ◽

pp. 4247

Author(s):

Andrea Martisova ◽

Jitka Holcakova ◽

Nasim Izadi ◽

Ravery Sebuyoya ◽

Roman Hrstka ◽

...

Keyword(s):

Dna Methylation ◽

Solid Tumors ◽

Epigenetic Modification ◽

Single Gene ◽

Restriction Enzymes ◽

Methylation Analysis ◽

Cpg Dinucleotides ◽

Sequencing Technologies ◽

Genome Level ◽

Dna Methylation Analysis

DNA methylation, i.e., addition of methyl group to 5′-carbon of cytosine residues in CpG dinucleotides, is an important epigenetic modification regulating gene expression, and thus implied in many cellular processes. Deregulation of DNA methylation is strongly associated with onset of various diseases, including cancer. Here, we review how DNA methylation affects carcinogenesis process and give examples of solid tumors where aberrant DNA methylation is often present. We explain principles of methods developed for DNA methylation analysis at both single gene and whole genome level, based on (i) sodium bisulfite conversion, (ii) methylation-sensitive restriction enzymes, and (iii) interactions of 5-methylcytosine (5mC) with methyl-binding proteins or antibodies against 5mC. In addition to standard methods, we describe recent advances in next generation sequencing technologies applied to DNA methylation analysis, as well as in development of biosensors that represent their cheaper and faster alternatives. Most importantly, we highlight not only advantages, but also disadvantages and challenges of each method.

Download Full-text

Sequencing and Computational Approaches to Identification and Characterization of Microbial Organisms

Biomedical Engineering and Computational Biology ◽

10.4137/becb.s10886 ◽

2013 ◽

Vol 5 ◽

pp. BECB.S10886 ◽

Cited By ~ 2

Author(s):

Brijesh Singh Yadav ◽

Venkateswarlu Ronda ◽

Dinesh P. Vashista ◽

Bhaskar Sharma

Keyword(s):

Sequence Data ◽

Microbial Interactions ◽

Microbial Pathogens ◽

Nucleotide Sequence Data ◽

Computational Approaches ◽

Microbial Detection ◽

Sequencing Technologies ◽

Sequencing Platforms ◽

Identification And Characterization

The recent advances in sequencing technologies and computational approaches are propelling scientists ever closer towards complete understanding of human-microbial interactions. The powerful sequencing platforms are rapidly producing huge amounts of nucleotide sequence data which are compiled into huge databases. This sequence data can be retrieved, assembled, and analyzed for identification of microbial pathogens and diagnosis of diseases. In this article, we present a commentary on how the metagenomics incorporated with microarray and new sequencing techniques are helping microbial detection and characterization.

Download Full-text

148 Multiple Dysregulated Novel Pathways and Genes in Aleutian Mink Disease Revealed by Selection Signatures and Gene Network Analyses Using Whole-genome Sequence Data

Journal of Animal Science ◽

10.1093/jas/skab235.137 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 76-76

Author(s):

Seyed Milad Vahedi ◽

Karim Karimi ◽

Siavash Salek Ardestani ◽

Younes Miar

Keyword(s):

Sequence Data ◽

American Mink ◽

Enrichment Analysis ◽

Whole Genome Sequence ◽

Fixation Index ◽

Pathway Enrichment Analysis ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Network Analyses ◽

Genome Level

Abstract Aleutian disease (AD) is a chronic persistent infection in domestic mink caused by Aleutian mink disease virus (AMDV). Female mink’s fertility and pelt quality depression are the main reasons for the AD’s negative economic impacts on the mink industry. A total number of 79 American mink from the Canadian Center for Fur Animal Research at Dalhousie University (Truro, NS, Canada) were classified based on the results of counter immunoelectrophoresis (CIEP) tests into two groups of positive (n = 48) and negative (n = 31). Whole-genome sequences comprising 4,176 scaffolds and 8,039,737 single nucleotide polymorphisms (SNPs) were used to trace the selection footprints for response to AMDV infection at the genome level. Window-based fixation index (Fst) and nucleotide diversity (θπ) statistics were estimated to compare positive and negative animals’ genomes. The overlapped top 1% genomic windows between two statistics were considered as potential regions underlying selection pressures. A total of 98 genomic regions harboring 33 candidate genes were detected as selective signals. Most of the identified genes were involved in the development and functions of immune system (PPP3CA, SMAP2, TNFRSF21, SKIL, and AKIRIN2), musculoskeletal system (COL9A2, PPP1R9A, ANK2, AKAP9, and STRIT1), nervous system (ASCL1, ZFP69B, SLC25A27, MCF2, and SLC7A14), reproductive system (CAMK2D, GJB7, SSMEM1, C6orf163), liver (PAH and DPYD), and lung (SLC35A1). Gene-expression network analysis showed the interactions among 27 identified genes. Moreover, pathway enrichment analysis of the constructed genes network revealed significant oxytocin (KEGG: hsa04921) and GnRH signaling (KEGG: hsa04912) pathways, which are likely to be impaired by AMDV leading to dams’ fecundity reduction. These results provided a perspective to the genetic architecture of response to AD in American mink and novel insight into the pathogenesis of AMDV.

Download Full-text

A New Paralog Removal Pipeline Resolves Conflict between RAD-seq and Enrichment

10.1101/2020.10.26.355248 ◽

2020 ◽

Author(s):

Wenbin Zhou ◽

John Soghigian ◽

Qiu-yun (Jenny) Xiang

Keyword(s):

High Throughput Sequencing ◽

Sequence Similarity ◽

Phylogenetic Analyses ◽

Disjunct Distribution ◽

Divergence Times ◽

Target Enrichment ◽

Sequencing Technologies ◽

Duplication Events ◽

The Witch ◽

Phylogenomic Analyses

ABSTRACTTarget enrichment and RAD-seq are well-established high throughput sequencing technologies that have been increasingly used for phylogenomic studies, and the choice between methods is a practical issue for plant systematists studying the evolutionary histories of biodiversity of relatively recent origins. However, few studies have compared the congruence and conflict between results from the two methods within the same group of organisms, especially in plants, where extensive genome duplication events may complicate phylogenomic analyses. Unfortunately, currently widely used pipelines for target enrichment data analysis do not have a vigorous procedure for remove paralogs in Hyb-Seq data. In this study, we employed RAD-seq and Hyb-Seq of Angiosperm 353 genes in phylogenomic and biogeographic studies of Hamamelis (the witch-hazels) and Castanea (chestnuts), two classic examples exhibiting the well-known eastern Asian-eastern North American disjunct distribution. We compared these two methods side by side and developed a new pipeline (PPD) with a more vigorous removal of putative paralogs from Hyb-Seq data. The new pipeline considers both sequence similarity and heterozygous sites at each locus in identification of paralogous. We used our pipeline to construct robust datasets for comparison between methods and downstream analyses on the two genera. Our results demonstrated that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed significant differences between data from HybPiper and data from our new PPD pipeline, likely due to the error signals from the paralogous genes undetected by HybPiper, but trimmed by PPD. We found that phylogenies and divergence times estimated from our RAD-seq and Hyb-Seq-PPD were largely congruent. We highlight the importance of removal paralogs in enrichment data, and discuss the merits of RAD-seq and Hyb-Seq. Finally, phylogenetic analyses of RAD-seq and Hyb-Seq resulted in well-resolved species relationships, and revealed ancient introgression in both genera. Biogeographic analyses including fossil data revealed a complicated history of each genus involving multiple intercontinental dispersals and local extinctions in areas outside of the taxa’s modern ranges in both the Paleogene and Neogene. Our study demonstrates the value of additional steps for filtering paralogous gene content from Angiosperm 353 data, such as our new PPD pipeline described in this study. [RAD-seq, Hyb-Seq, paralogs, Castanea, Hamamelis, eastern Asia-eastern North America disjunction, biogeography, ancient introgression]

Download Full-text

Within-Arctic horizontal gene transfer as a driver of convergent evolution in distantly related microalgae

10.1101/2021.07.31.454568 ◽

2021 ◽

Author(s):

Richard G Dorrell ◽

Alan Kuo ◽

Zoltan Fussy ◽

Elisabeth H Richardson ◽

Asaf Salamov ◽

...

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Algal Species ◽

Gene Families ◽

The Arctic ◽

Arctic Water ◽

Diverse Range ◽

Binding Domains ◽

Ice Binding ◽

Ice Conditions

The Arctic Ocean is being impacted by warming temperatures, increasing freshwater and highly variable ice conditions. The microalgal communities underpinning Arctic marine food webs, once thought to be dominated by diatoms, include a phylogenetically diverse range of small algal species, whose biology remains poorly understood. Here, we present genome sequences of a cryptomonad, a haptophyte, a chrysophyte, and a pelagophyte, isolated from the Arctic water column and ice. Comparing protein family distributions and sequence similarity across a densely-sampled set of algal genomes and transcriptomes, we note striking convergences in the biology of distantly related small Arctic algae, compared to non-Arctic relatives; although this convergence is largely exclusive of Arctic diatoms. Using high-throughput phylogenetic approaches, incorporating environmental sequence data from Tara Oceans, we demonstrate that this convergence was partly explained by horizontal gene transfers (HGT) between Arctic species, in over at least 30 other discrete gene families, and most notably in ice-binding domains (IBD). These Arctic-specific genes have been repeatedly transferred between Arctic algae, and are independent of equivalent HGTs in the Antarctic Southern Ocean. Our data provide insights into the specialised Arctic marine microbiome, and underlines the role of geographically-limited HGT as a driver of environmental adaptation in eukaryotic algae.

Download Full-text

Mining the capacity of human-associated microorganisms to trigger rheumatoid arthritis—A systematic immunoinformatics analysis of T cell epitopes

PLoS ONE ◽

10.1371/journal.pone.0253918 ◽

2021 ◽

Vol 16 (6) ◽

pp. e0253918

Author(s):

Jelena Repac ◽

Marija Mandić ◽

Tanja Lunić ◽

Bojan Božić ◽

Biljana Božić Nedeljković

Keyword(s):

Rheumatoid Arthritis ◽

T Cell ◽

De Novo ◽

Molecular Mimicry ◽

Sequence Similarity ◽

Cell Epitope ◽

Homology Search ◽

T Cell Epitope ◽

T Cell Epitopes ◽

Painful Condition

Autoimmune diseases, often triggered by infection, affect ~5% of the worldwide population. Rheumatoid Arthritis (RA)–a painful condition characterized by the chronic inflammation of joints—comprises up to 20% of known autoimmune pathologies, with the tendency of increasing prevalence. Molecular mimicry is recognized as the leading mechanism underlying infection-mediated autoimmunity, which assumes sequence similarity between microbial and self-peptides driving the activation of autoreactive lymphocytes. T lymphocytes are leading immune cells in the RA-development. Therefore, deeper understanding of the capacity of microorganisms (both pathogens and commensals) to trigger autoreactive T cells is needed, calling for more systematic approaches. In the present study, we address this problem through a comprehensive immunoinformatics analysis of experimentally determined RA-related T cell epitopes against the proteomes of Bacteria, Fungi, and Viruses, to identify the scope of organisms providing homologous antigenic peptide determinants. By this, initial homology screening was complemented with de novo T cell epitope prediction and another round of homology search, to enable: i) the confirmation of homologous microbial peptides as T cell epitopes based on the predicted binding affinity to RA-related HLA polymorphisms; ii) sequence similarity inference for top de novo T cell epitope predictions to the RA-related autoantigens to reveal the robustness of RA-triggering capacity for identified (micro/myco)organisms. Our study reveals a much larger repertoire of candidate RA-triggering organisms, than previously recognized, providing insights into the underestimated role of Fungi in autoimmunity and the possibility of a more direct involvement of bacterial commensals in RA-pathology. Finally, our study pinpoints Endoplasmic reticulum chaperone BiP as the most potent (most likely mimicked) RA-related autoantigen, opening an avenue for identifying the most potent autoantigens in a variety of different autoimmune pathologies, with possible implications in the design of next-generation therapeutics aiming to induce self-tolerance by affecting highly reactive autoantigens.

Download Full-text

Bridging the TB data gap: in silico extraction of rifampicin-resistant tuberculosis diagnostic test results from whole genome sequence data

10.1101/628099 ◽

2019 ◽

Author(s):

Kamela Charmaine S. Ng ◽

Jean Claude S. Ngabonziza ◽

Pauline Lempens ◽

Bouke Catherine de Jong ◽

Frank van Leth ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Data Generation ◽

Continuous Analysis ◽

Middle Income ◽

Tb Control ◽

Sequencing Technologies ◽

Surveillance Programs ◽

Low And Middle Income

AbstractBackgroundMycobacterium tuberculosis rapid diagnostic tests (RDTs) are widely employed in routine laboratories and national surveys for detection of rifampicin-resistant (RR)-TB. However, as next generation sequencing technologies have become more commonplace in research and surveillance programs, RDTs are being increasingly complemented by whole genome sequencing (WGS). While comparison between RDTs is difficult, all RDT results can be derived from WGS data. This can facilitate continuous analysis of RR-TB burden regardless of the data generation technology employed. By converting WGS to RDT results, we enable comparison of data with different formats and sources particularly for low and middle income high TB burden countries that employ different diagnostic algorithms for drug resistance surveys. This allows national TB control programs (NTPs) and epidemiologists to utilize all available data in the setting for improved RR-TB surveillance.MethodsWe developed the Python-based MTB Genome to Test (MTBGT) tool that transforms WGS-derived data into laboratory-validated results of the primary RDTs – Xpert MTB/RIF, XpertMTB/RIF Ultra, GenoType MDRTBplus v2.0, and GenoscholarNTM+MDRTB II. The tool was validated through RDT results of RR-TB strains with diverse resistance patterns and geographic origins and applied on routine-derived WGS data.ResultsThe MTBGT tool correctly transformed the SNP data into the RDT results and generated tabulated frequencies of the RDT probes as well as rifampicin susceptible cases. The tool supplemented the RDT probe reactions output with the RR-conferring mutation based on identified SNPs. The MTBGT tool facilitated continuous analysis of RR-TB and Xpert probe reactions from different platforms and collection periods in Rwanda.ConclusionOverall, the MTBGT tool allows low and middle income countries to make sense of the increasingly generated WGS in light of the readily available RDT results, and assess whether currently implemented RDTs adequately detect RR-TB in their setting. With its feature to transform WGS to RDT results and facilitate continuous RR-TB data analysis, the MTBGT tool may bridge the gap between and among data from periodic surveys, continuous surveillance, research, and routine tests, and may be integrated within the existing national connectivity platform for use by the NTP and epidemiologists to improve setting-specific RR-TB control. The MTBGT source code and accompanying documentation is available at https://github.com/KamelaNg/MTBGT.

Download Full-text

VGEA: an RNA viral assembly toolkit

PeerJ ◽

10.7717/peerj.12129 ◽

2021 ◽

Vol 9 ◽

pp. e12129

Author(s):

Paul E. Oluniyi ◽

Fehintola Ajogbasile ◽

Judith Oguzie ◽

Jessica Uwanibe ◽

Adeyemi Kayode ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Workflow Management ◽

Viral Population ◽

Lassa Virus ◽

Viral Genomes ◽

Bioinformatics Tools ◽

Reference Sequences ◽

Genome Assemblies

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.

Download Full-text