similarity thresholds
Recently Published Documents


TOTAL DOCUMENTS

32
(FIVE YEARS 10)

H-INDEX

6
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Peter W Schafran ◽  
Fay-Wei W Li ◽  
Carl Rothfels

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.


Author(s):  
Aurélie Bonin ◽  
Alessia Guerrieri ◽  
Francesco Ficetola

Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding datasets, therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences in Molecular Operational Taxonomic Units (MOTUs) that ideally each represent a homogeneous taxonomic entity, e.g. a species or a genus. However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for eight markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g. Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold (e.g. 0.97). Instead, we advocate a careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.


2021 ◽  
pp. 155335062110027
Author(s):  
Elham Rastegari ◽  
Donovan Orn ◽  
Mohsen Zahiri ◽  
Carl Nelson ◽  
Hesham Ali ◽  
...  

Background: Medical devices are becoming more complex, and doctors need to learn quickly how to use new medical tools. However, it is challenging to objectively assess the fundamental laparoscopic surgical skill level and determine skill readiness for advancement. There is a lack of objective models to compare performance between medical trainees and experienced doctors. Methods: This article discusses the use of similarity network models for individual tasks and a combination of tasks to show the level of similarity between residents and medical students while performing each task and their overall laparoscopic surgical skill level using a medical device (eg laparoscopic instruments). When a medical student is connected to most residents, that student is competent to the next training level. Performance of sixteen participants (5 residents and 11 students) while performing 3 tasks in 3 different training schedules is used in this study. Results: The promising result shows the general positive progression of students over 4 training sessions. Our results also indicate that students with different training schedules have different performance levels. Students’ progress in performing a task is quicker if the training sessions are held more closely compared to when the training sessions are far apart in time. Conclusions: This study provides a graph-based framework for evaluating new learners’ performance on medical devices and their readiness for advancement. This similarity network method could be used to classify students’ performance using similarity thresholds, facilitating decision-making related to training and progression through curricula.


2020 ◽  
Author(s):  
Ruben De Lange ◽  
Slavomír Adamčík ◽  
Katarína Adamčíkova ◽  
Pieter Asselman ◽  
Jan Borovička ◽  
...  

Abstract Russula albonigra is considered a well-known species, morphologically delimited by the context of the basidiomata that is blackening without intermediate reddening, and the menthol-cooling taste of the lamellae. It is supposed to have a broad ecological amplitude and a large distribution area. A thorough molecular analysis based on four nuclear markers (ITS, LSU, RPB2 and TEF1-α) shows this traditional concept of R. albonigra s.l. represents a species complex consisting of at least five European, three North-American and one Chinese species. Morphological study shows traditional characters used to delimit R. albonigra are not always reliable. Therefore, a new delimitation of the R. albonigra lineage is proposed and a key to the described European species of R. subg. Compactae is presented. A lectotype and an epitype are designated for R. albonigra and three new European species are described: R. ambusta, R. nigrifacta and R. ustulata. UNITE species hypotheses at different thresholds were tested against the taxonomic data. The species hypotheses at the similarity threshold 0.5% give a perfect match to the phylogenetically defined species within the R. albonigra lineage. Publicly available sequence data can contribute to species delimitation and expand knowledge on ecology and distribution, but the pitfalls are short and low quality sequences. The importance of updating public taxonomic data and using correct sequence similarity thresholds is emphasised.


Author(s):  
Peter Christen ◽  
Eilidh Garrett ◽  
Beata Nowok ◽  
Alice Reid ◽  
Lee Williamson ◽  
...  

IntroductionThe Digitising Scotland project (https://digitisingscotland.ac.uk/) has transcribed all Scottish birth, death, and marriage certificates from 1855 to 1974. The linkage of these data will provide formidable challenges for linkage experts and a multitude of opportunities for health and social science researchers. Objectives and approachWe linked birth between November 1916 and December 1923 to death of women aged 15 to 49 who died between January 1917 and December 1923. We only linked a death up-to 42 days after a birth. We compared parent names with those of the deceased and her spouse, as well as their address, using string matching functions. Given the lack of ground truth data, we conducted a sensitivity analysis using different similarity thresholds to classify birth linked to death as matches (if their similarity was at least a given threshold), or otherwise as non-matches. We then calculated the maternal mortality rate (MMR) on a monthly basis for January 2017 to July 2018 (the pre-flu period), and October 1918 to March 1919 (the flu period) as the number of matched birth to death certificates divided by the average monthly number of birth in the month of death and the previous month. ResultsPuerperal risk of death during this period was much higher than for the comparable female age group during the different phases of the 1918 flu. During the October 1918 wave, puerperal women were at greater risk than women of the same age, while the risk dropped below the expected in the summer of 1919. We carried out a sensitivity analysis by linking with different similarity thresholds and found our findings were robust to these decisions. ConclusionsWe have shown how a large and complex data collection can be successfully linked, resulting in new opportunities for various studies in the health and socialsciences. 


Viruses ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1268
Author(s):  
Cristina Moraru ◽  
Arvind Varsani ◽  
Andrew M. Kropinski

Nucleotide-based intergenomic similarities are useful to understand how viruses are related with each other and to classify them. Here we have developed VIRIDIC, which implements the traditional algorithm used by the International Committee on Taxonomy of Viruses (ICTV), Bacterial and Archaeal Viruses Subcommittee, to calculate virus intergenomic similarities. When compared with other software, VIRIDIC gave the best agreement with the traditional algorithm, which is based on the percent identity between two genomes determined by BLASTN. Furthermore, VIRIDIC proved best at estimating the relatedness between more distantly-related phages, relatedness that other tools can significantly overestimate. In addition to the intergenomic similarities, VIRIDIC also calculates three indicators of the alignment ability to capture the relatedness between viruses: the aligned fractions for each genome in a pair and the length ratio between the two genomes. The main output of VIRIDIC is a heatmap integrating the intergenomic similarity values with information regarding the genome lengths and the aligned genome fraction. Additionally, VIRIDIC can group viruses into clusters, based on user-defined intergenomic similarity thresholds. The sensitivity of VIRIDIC is given by the BLASTN. Thus, it is able to capture relationships between viruses having in common even short genomic regions, with as low as 65% similarity. Below this similarity level, protein-based analyses should be used, as they are the best suited to capture distant relationships. VIRIDIC is available at viridic.icbm.de, both as a web-service and a stand-alone tool. It allows fast analysis of large phage genome datasets, especially in the stand-alone version, which can be run on the user’s own servers and can be integrated in bioinformatics pipelines. VIRIDIC was developed having viruses of Bacteria and Archaea in mind; however, it could potentially be used for eukaryotic viruses as well, as long as they are monopartite.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Sion C Bayliss ◽  
Harry A Thorpe ◽  
Nicola M Coyle ◽  
Samuel K Sheppard ◽  
Edward J Feil

Abstract Background Cataloguing the distribution of genes within natural bacterial populations is essential for understanding evolutionary processes and the genetic basis of adaptation. Advances in whole genome sequencing technologies have led to a vast expansion in the amount of bacterial genomes deposited in public databases. There is a pressing need for software solutions which are able to cluster, catalogue and characterise genes, or other features, in increasingly large genomic datasets. Results Here we present a pangenomics toolbox, PIRATE (Pangenome Iterative Refinement and Threshold Evaluation), which identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds. PIRATE builds upon recent scalable software developments to allow for the rapid interrogation of thousands of isolates. PIRATE clusters genes (or other annotated features) over a wide range of amino acid or nucleotide identity thresholds and uses the clustering information to rapidly identify paralogous gene families and putative fission/fusion events. Furthermore, PIRATE orders the pangenome using a directed graph, provides a measure of allelic variation, and estimates sequence divergence for each gene family. Conclusions We demonstrate that PIRATE scales linearly with both number of samples and computation resources, allowing for analysis of large genomic datasets, and compares favorably to other popular tools. PIRATE provides a robust framework for analysing bacterial pangenomes, from largely clonal to panmictic species.


2019 ◽  
Vol 47 (W1) ◽  
pp. W357-W364 ◽  
Author(s):  
Antoine Daina ◽  
Olivier Michielin ◽  
Vincent Zoete

Abstract SwissTargetPrediction is a web tool, on-line since 2014, that aims to predict the most probable protein targets of small molecules. Predictions are based on the similarity principle, through reverse screening. Here, we describe the 2019 version, which represents a major update in terms of underlying data, backend and web interface. The bioactivity data were updated, the model retrained and similarity thresholds redefined. In the new version, the predictions are performed by searching for similar molecules, in 2D and 3D, within a larger collection of 376 342 compounds known to be experimentally active on an extended set of 3068 macromolecular targets. An efficient backend implementation allows to speed up the process that returns results for a druglike molecule on human proteins in 15–20 s. The refreshed web interface enhances user experience with new features for easy input and improved analysis. Interoperability capacity enables straightforward submission of any input or output molecule to other on-line computer-aided drug design tools, developed by the SIB Swiss Institute of Bioinformatics. High levels of predictive performance were maintained despite more extended biological and chemical spaces to be explored, e.g. achieving at least one correct human target in the top 15 predictions for >70% of external compounds. The new SwissTargetPrediction is available free of charge (www.swisstargetprediction.ch).


2019 ◽  
Vol 4 (1) ◽  
Author(s):  
Chang Chen ◽  
Shixue Sun ◽  
Zhixin Cao ◽  
Yan Shi ◽  
Baoqing Sun ◽  
...  

Abstract Sample entropy is a powerful tool for analyzing the complexity and irregularity of physiology signals which may be associated with human health. Nevertheless, the sophistication of its calculation hinders its universal application. As of today, the R language provides multiple open-source packages for calculating sample entropy. All of which, however, are designed for different scenarios. Therefore, when searching for a proper package, the investigators would be confused on the parameter setting and selection of algorithms. To ease their selection, we have explored the functions of five existing R packages for calculating sample entropy and have compared their computing capability in several dimensions. We used four published datasets on respiratory and heart rate to study their input parameters, types of entropy, and program running time. In summary, NonlinearTseries and CGManalyzer can provide the analysis of sample entropy with different embedding dimensions and similarity thresholds. CGManalyzer is a good choice for calculating multiscale sample entropy of physiological signal because it not only shows sample entropy of all scales simultaneously but also provides various visualization plots. MSMVSampEn is the only package that can calculate multivariate multiscale entropies. In terms of computing time, NonlinearTseries, CGManalyzer, and MSMVSampEn run significantly faster than the other two packages. Moreover, we identify the issues in MVMSampEn package. This article provides guidelines for researchers to find a suitable R package for their analysis and applications using sample entropy.


Sign in / Sign up

Export Citation Format

Share Document