alignment score
Recently Published Documents


TOTAL DOCUMENTS

26
(FIVE YEARS 1)

H-INDEX

8
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Kaitlin M. Carey ◽  
Robert Hubley ◽  
George T. Lesica ◽  
Daniel Olson ◽  
Jack W. Roddy ◽  
...  

AbstractAnnotation of a biological sequence is usually performed by aligning that sequence to a database of known sequence elements. When that database contains elements that are highly similar to each other, the proper annotation may be ambiguous, because several entries in the database produce high-scoring alignments. Typical annotation methods work by assigning a label based on the candidate annotation with the highest alignment score; this can overstate annotation certainty, mislabel boundaries, and fails to identify large scale rearrangements or insertions within the annotated sequence. Here, we present a new software tool, PolyA, that adjudicates between competing alignment-based annotations by computing estimates of annotation confidence, identifying a trace with maximal confidence, and recursively splicing/stitching inserted elements. PolyA communicates annotation certainty, identifies large scale rearrangements, and detects boundaries between neighboring elements.



2020 ◽  
Author(s):  
Wenfa Ng

AbstractUnderstanding how one sequence relates to another at the nucleotide or amino acid level allows the derivation of new knowledge regarding the provenance of particular sequence as well as the determination of consensus sequence motifs that informs biological conservation at the sequence level. To this end, local or multiple sequence alignments tools in bioinformatics have been developed to automatically profile two or more nucleotide or amino acid sequence in search of matches in stretches of nucleotides or amino acid sequence that yield an alignment. While alignment score is a common metric for assessing alignment quality, relative difference between alignment scores does not readily correlate with concrete measures such as number of mismatches and length of longest match in alignment. Thus, using swalign local sequence alignment function in MATLAB on 200 alignments between RNA-seq sequence read and reference Escherichia coli K-12 MG1655 genome sequence in the sense and antisense direction, this work sought to shed some light on how alignment score from swalign correlates with number of mismatches and length of longest match. Results revealed that number of mismatches negatively correlate with alignment score; thereby, validating theoretical predictions that larger number of mismatches would result in a poorer alignment and lower alignment score. However, dependence of alignment score on other factors such as length of longest match and gap penalty from opening an alignment gap prevents linear relationship to be obtained between number of mismatches and alignment score. On the other hand, length of longest match was found to positively correlate with alignment score as predicted from theoretical understanding. But, data obtained revealed that clusters of data points gather at two regions of the scatter plot involving short matches and low alignment score, as well as long matches and high alignment score. Such clustering and sparseness of data points between the two clusters preclude the elucidation of a linear quantitative relationship between length of longest match and alignment score. Overall, dependence of alignment score of swalign on number of mismatches and length of longest match in alignment match theoretical predictions; thereby, validating the utility of alignment score in indicating the qualitative quality of alignment. However, given that alignment score inherently depends on a multitude of factors, users could not easily discern the quantitative difference in mismatches and length of longest match from relative differences between two alignment scores. Such problems are unlikely to be resolved given the near impossibility of obtaining quantitative linear relationship correlating either number of mismatches or length of longest match with alignment score of a sequence alignment tool.HighlightsNumber of mismatches in alignment negatively correlates with alignment score.Length of longest match positively correlates with alignment score.Quantitative linear relationship could not be obtained for alignment score with either number of mismatches or length of longest match.Results validate that swalign tool in MATLAB could quantitatively detect differences in alignment quality and expressed it using alignment score.But, relative alignment score of two alignments remains a nebulous concept with regards to differences in number of mismatches and length of longest match.



2020 ◽  
Author(s):  
Akshay Yadav ◽  
David Fernández-Baca ◽  
Steven B. Cannon

AbstractGene families are groups of genes that have descended from a common ancestral gene present in the species under study. Current, widely used gene family building algorithms can produce family clusters that may be fragmented or missing true family sequences (under-clustering). Here we present a classification method based on sequence pairs that, first, inspects given families for under-clustering and then predicts the missing sequences for the families using family-specific alignment score cutoffs. We have tested this method on a set of curated, gold-standard (“true”) families from the Yeast Gene Order Browser (YGOB) database, including 20 yeast species, as well as a test set of intentionally under-clustered (“deficient”) families derived from the YGOB families. For 83% of the modified yeast families, our pair-classification method was able to reliably detect under-clustering in “deficient” families that were missing 20% of sequences relative to the full/” true” families. We also attempted to predict back the missing sequences using the family-specific alignment score cutoffs obtained during the detection phase. In the case of “pure” under-clustered families (under-clustered families with no “wrong”/unrelated sequences), for 78% of families the prediction precision and recall was ≥0.75, with mean precision = 0.928 and mean recall = 0.859. For “impure” under-clustered families, (under-clustered families containing closest sequences from outside the family, in addition to missing true family sequences), the prediction precision and recall was ≥0.75 for 63% of families with mean precision = 0.790 and mean recall = 0.869. To check if our method can detect and correct incomplete families obtained using existing family building methods, we attempted to correct 374 under-clustered yeast families produced using the OrthoFinder tool. We were able to predict missing sequences for at least 19 yeast families with mean precision of 0.9 and mean recall of 0.65. We also analyzed 14,663 legume families built using the OrthoFinder program, with 14 legume species. We were able to identify 1,665 OrthoFinder families that were missing one or more sequences - sequences which were previously un-clustered or clustered into unusually small families. Further, using a simple merging strategy, we were able to merge 2,216 small families into 933 under-clustered families using the predicted missing sequences. Out of the 933 merged families, we could confirm correct mergings in at least 534 families using the maximum-likelihood phylogenies of the merged families. We also provide recommendations on different types of family-specific alignment score cutoffs that can be used for predicting the missing sequences based on the “purity” of under-clustered families and the chosen precision and recall for prediction. Finally, we provide the containerized version of the pair-classification method that can be applied on any given set of gene families.



2019 ◽  
Author(s):  
Daan R. Speth ◽  
Victoria J. Orphan

AbstractRapid advances in sequencing technology have resulted in the availability of genomes from organisms across the tree of life. Accurately interpreting the function of proteins in these genomes is a major challenge, as annotation transfer based on homology frequently results in misannotation and error propagation. This challenge is especially pressing for organisms whose genomes are directly obtained from environmental samples, as interpretation of their physiology and ecology is often based solely on the genome sequence. For complex protein (super)families containing a large number of sequences, classification can be used to determine whether annotation transfer is appropriate, or whether experimental evidence for function is lacking. Here we present a novel computational approach for de novo classification of large protein (super)families, based on clustering an alignment score matrix obtained by aligning all sequences in the family to a small subset of the data. We evaluate our approach on the enolase family in the Structure Function Linkage Database.Availability and implementationASM-Clust is implemented in bash with helper scripts in perl. Scripts comprising ASM-Clust are available for download from https://github.com/dspeth/bioinfo_scripts/tree/master/ASM_clust/



2019 ◽  
Author(s):  
Ronald P. Hart

Single-cell RNA sequencing (scRNAseq) is a robust technology for parsing gene expression in individual cells from a tissue or other complex source. One application involves experiments where cells from multiple species are recovered from a single sample, such as when human cells are transplanted into an animal model. We transplanted microglial precursor cells into newborn mouse brain and then recovered unenriched cortical tissue six months later. Dissociated cells were assessed by scRNAseq. The default method for analyzing these results begins by aligning sequencing reads with a mixture of both mouse and human reference genomes. While this clearly identifies the human cells as a distinct cluster, the clustering is artificially driven by expression from non-comparable gene identifiers from different species. We devised a method for translating expression counts from human to mouse and evaluated four algorithms for parsing mixed-species scRNAseq data. Our optimal approach split raw sequencing reads according to the best alignment score in each genome, and then re-aligned reads only with the appropriate genome. After gene symbol translation, pooled results indicate that cell types are more appropriately clustered and that differential expression analysis identifies species-specific patterns. This method should be applicable to any mixed-species scRNAseq experiment.Summary of optimal strategyMixed-species scRNAseq data are aligned with mixture of mouse and human reference genomesThe BAM file is scanned to find the best alignment score for each sequencing read identifier; these are used to split the paired FASTQ files into two sets of filesEach set of species-specific, paired FASTQ files is re-aligned with only the appropriate reference genomeRaw counts imported into SeuratThe human counts table is translated to mouse gene symbols using a custom HomoloGene translation tableResults are merged and analyzed



2018 ◽  
Vol 45 (7) ◽  
pp. 2898-2911 ◽  
Author(s):  
Catriona Hargrave ◽  
Timothy Deegan ◽  
Michael Poulsen ◽  
Tomasz Bednarz ◽  
Fiona Harden ◽  
...  


2018 ◽  
Author(s):  
Jacob Pritt ◽  
Nae-Chyun Chen ◽  
Ben Langmead

AbstractThere is growing interest in using genetic variants to augment the reference genome into a “graph genome” to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignment-score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead.



2017 ◽  
Author(s):  
Koen Deforche

AbstractMotivationBiological sequence alignment is fundamental to their further interpretation. Current alignment algorithms typically align either nucleic acid or amino acid sequences. Using only nucleic acid sequence similarity, divergent sequences cannot be aligned reliably because of the limited alphabet and genetic saturation. To align divergent coding nucleic acid sequences, one can align using the translated amino acid sequences. This requires the detection of the correct open reading frame, is prone to eventual frame shift errors, and typically requires the treatment of genes separately. It was our motivation to design a nucleic acid sequence alignment algorithm to align a nucleic acid sequence against a (reference) genome sequence, that works equally well for similar and divergent sequences, and produces an optimal alignment considering simultaneously the alignment of all annotated coding sequences.ResultsWe define a genome alignment score for evaluating the quality of an alignment of a nucleic acid query sequence against a reference genome sequence, for which coding sequence features have been annotated (for example in a GenBank record). The genome alignment score combines the a ne gap score for the nucleic acid sequence with an a ne gap score for all amino acid alignments resulting from coding sequences in open reading frames contained within the query sequence. We present a Dynamic Programming algorithm to compute the optimal global or local alignment using this genomic alignment score and provide a formal proof of correctness. This algorithm allows the alignment of nucleic acid sequences from closely related and highly divergent sequences within the same software and using the same parameters, automatically correcting any eventual frame shift errors and produces at the same time the aligned translated amino acid sequences of all relevant coding sequence features.AvailabilityThe software is available as a web application at http://www.genomedetective.com/app/aga and as command-line application at https://github.com/emweb/aga



2012 ◽  
Vol 10 (01) ◽  
pp. 1240001 ◽  
Author(s):  
GÜNHAN GÜLSOY ◽  
BHAVIK GANDHI ◽  
TAMER KAHVECI

We consider the problem of finding a subnetwork in a given biological network (i.e. target network) that is most similar to a given small query network. We aim to find the optimal solution (i.e. the subnetwork with the largest alignment score) with a provable confidence bound. There is no known polynomial time solution to this problem in the literature. Alon et al. has developed a state-of-the-art coloring method that reduces the cost of this problem. This method randomly colors the target network prior to alignment for many iterations until a user-supplied confidence is reached. Here we develop a novel coloring method, named k-hop coloring (k is a positive integer), that achieves a provable confidence value in a small number of iterations without sacrificing the optimality. Our method considers the color assignments already made in the neighborhood of each target network node while assigning a color to a node. This way, it preemptively avoids many color assignments that are guaranteed to fail to produce the optimal alignment. We also develop a filtering method that eliminates the nodes that cannot be aligned without reducing the alignment score after each coloring instance. We demonstrate both theoretically and experimentally that our coloring method outperforms that of Alon et al., which is also used by a number network alignment methods, including QPath and QNet, by a factor of three without reducing the confidence in the optimality of the result. Our experiments also suggest that the resulting alignment method is capable of identifying functionally enriched regions in the target network successfully.



2012 ◽  
Vol 5 (1) ◽  
pp. 286 ◽  
Author(s):  
Yonil Park ◽  
Sergey Sheetlin ◽  
Ning Ma ◽  
Thomas L Madden ◽  
John L Spouge


Sign in / Sign up

Export Citation Format

Share Document