Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data

Systematic Biology ◽

10.1093/sysbio/syab075 ◽

2021 ◽

Author(s):

David A Duchêne ◽

Niklas Mather ◽

Cara Van Der Wal ◽

Simon Y W Ho

Keyword(s):

Negative Impact ◽

Effective Means ◽

Model Performance ◽

Phylogenetic Inference ◽

Nucleotide Sequences ◽

Data Sets ◽

Data Set ◽

Deep Time ◽

Substitution Saturation ◽

Genome Scale

Abstract The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.]

Download Full-text

Excluding loci with substitution saturation improves inferences from phylogenomic data

10.1101/2021.08.28.457888 ◽

2021 ◽

Author(s):

David A. Duchêne ◽

Niklas Mather ◽

Cara Van Der Wal ◽

Simon Y.W. Ho

Keyword(s):

Negative Impact ◽

Effective Means ◽

Phylogenetic Inference ◽

Nucleotide Sequences ◽

Data Sets ◽

Data Set ◽

Deep Time ◽

Substitution Saturation ◽

Genome Scale ◽

Scale Data

AbstractThe historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences.

Download Full-text

Using target enrichment sequencing to study the higher-level phylogeny of the largest lichen-forming fungi family: Parmeliaceae (Ascomycota)

IMA Fungus ◽

10.1186/s43008-020-00051-x ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Felix Grewe ◽

Claudio Ametrano ◽

Todd J. Widhelm ◽

Steven Leavitt ◽

Isabel Distefano ◽

...

Keyword(s):

Data Sets ◽

Target Enrichment ◽

Data Set ◽

Reduced Genome ◽

Genome Data ◽

Worldwide Distribution ◽

Phylogenetic Studies ◽

Genome Scale ◽

Scale Data ◽

Promising Avenue

AbstractParmeliaceae is the largest family of lichen-forming fungi with a worldwide distribution. We used a target enrichment data set and a qualitative selection method for 250 out of 350 genes to infer the phylogeny of the major clades in this family including 81 taxa, with both subfamilies and all seven major clades previously recognized in the subfamily Parmelioideae. The reduced genome-scale data set was analyzed using concatenated-based Bayesian inference and two different Maximum Likelihood analyses, and a coalescent-based species tree method. The resulting topology was strongly supported with the majority of nodes being fully supported in all three concatenated-based analyses. The two subfamilies and each of the seven major clades in Parmelioideae were strongly supported as monophyletic. In addition, most backbone relationships in the topology were recovered with high nodal support. The genus Parmotrema was found to be polyphyletic and consequently, it is suggested to accept the genus Crespoa to accommodate the species previously placed in Parmotrema subgen. Crespoa. This study demonstrates the power of reduced genome-scale data sets to resolve phylogenetic relationships with high support. Due to lower costs, target enrichment methods provide a promising avenue for phylogenetic studies including larger taxonomic/specimen sampling than whole genome data would allow.

Download Full-text

Impact of baseline clinical and radiological features on outcome of chronic rhinosinusitis in granulomatosis with polyangiitis

10.21203/rs.3.rs-67859/v2 ◽

2020 ◽

Author(s):

Sigrun Skaar Holme ◽

Karin Kilian ◽

Heidi B. Eggesbø ◽

Jon Magnus Moen ◽

Øyvind Molberg

Keyword(s):

Chronic Rhinosinusitis ◽

Granulomatosis With Polyangiitis ◽

Negative Impact ◽

Renal Involvement ◽

Ct Scans ◽

Data Sets ◽

Data Set ◽

Mucosal Disease ◽

Life Threatening ◽

Observation Period

Abstract Background: Granulomatosis with polyangiitis (GPA) causes a recurring inflammation in nose and paranasal sinuses that clinically resembles chronic rhinosinusitis (CRS) of other aetiologies. While sinonasal inflammation is not among the life-threatening features of GPA, patients report it to have major negative impact on quality of life. A relatively large proportion of GPA patients have severe CRS with extensive damage to nose and sinus structures evident by CT, but risk factors for severe CRS development remain largely unknown. In this study, we aimed to identify clinical and radiological predictors of CRS-related damage in GPA.Methods: We included GPA patients who had clinical data sets from time of diagnosis, and two or more paranasal sinus CT scans obtained ≥ 12 months apart available for analysis. We defined time from first to last CT as the study observation period, and evaluated CRS development across this period using CT scores for inflammatory sinus bone thickening (osteitis), bone destructions and sinus opacifications (here defined as mucosal disease). In logistic regression, we applied osteitis as main outcome measure for CRS-related damage.Results: We evaluated 697 CT scans obtained over median 5 years observation from 116 GPA patients. We found that 39% (45/116) of the GPA patients remained free from CRS damage across the study observation period, while 33% (38/116) had progressive damage. By end of observation, 32% (37/116) of the GPA patients had developed severe osteitis. We identified mucosal disease at baseline as a predictor for osteitis (Odds Ratio 1.33), and we found that renal involvement at baseline was less common in patients with severe osteitis at last CT (41%, 15/37) than in patients with no osteitis (60%, 27/45).Conclusions: In this largely unselected GPA patient cohort, baseline sinus mucosal disease associated with CRS-related damage, as measured by osteitis at end of follow-up. We found no significant association with clinical factors, but the data set indicated an inverse relationship between renal involvement and severe sinonasal affliction.

Download Full-text

The benefits of segmentation: Evidence from a South African bank and other studies

South African Journal of Science ◽

10.17159/sajs.2017/20160345 ◽

2017 ◽

Vol 113 (9/10) ◽

Cited By ~ 2

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

South African ◽

Direct Marketing ◽

Model Performance ◽

Predictive Modelling ◽

Modelling Technique ◽

Gradient Boosting ◽

Data Sets ◽

Data Set ◽

Linear Modelling ◽

Modelling Techniques

We applied different modelling techniques to six data sets from different disciplines in the industry, on which predictive models can be developed, to demonstrate the benefit of segmentation in linear predictive modelling. We compared the model performance achieved on the data sets to the performance of popular non-linear modelling techniques, by first segmenting the data (using unsupervised, semi-supervised, as well as supervised methods) and then fitting a linear modelling technique. A total of eight modelling techniques was compared. We show that there is no one single modelling technique that always outperforms on the data sets. Specifically considering the direct marketing data set from a local South African bank, it is observed that gradient boosting performed the best. Depending on the characteristics of the data set, one technique may outperform another. We also show that segmenting the data benefits the performance of the linear modelling technique in the predictive modelling context on all data sets considered. Specifically, of the three segmentation methods considered, the semi-supervised segmentation appears the most promising.

Download Full-text

An empirical investigation of alternative semi-supervised segmentation methodologies

South African Journal of Science ◽

10.17159/sajs.2019/5359 ◽

2019 ◽

Vol 115 (3/4) ◽

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

Logistic Regression ◽

Model Performance ◽

Predictive Modelling ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Supervised Segmentation ◽

Improved Performance ◽

Validation Set ◽

Combination Approach

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets. Significance: We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression. In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.

Download Full-text

On the importance of observational data properties when assessing regional climate model performance of extreme precipitation

Hydrology and Earth System Sciences ◽

10.5194/hess-17-4323-2013 ◽

2013 ◽

Vol 17 (11) ◽

pp. 4323-4337 ◽

Cited By ~ 30

Author(s):

M. A. Sunyer ◽

H. J. D. Sørup ◽

O. B. Christensen ◽

H. Madsen ◽

D. Rosbjerg ◽

...

Keyword(s):

Spatial Pattern ◽

Observational Data ◽

Extreme Precipitation ◽

Climate Models ◽

Climate Model ◽

Regional Climate ◽

Model Performance ◽

Regional Climate Models ◽

Data Sets ◽

Data Set

Abstract. In recent years, there has been an increase in the number of climate studies addressing changes in extreme precipitation. A common step in these studies involves the assessment of the climate model performance. This is often measured by comparing climate model output with observational data. In the majority of such studies the characteristics and uncertainties of the observational data are neglected. This study addresses the influence of using different observational data sets to assess the climate model performance. Four different data sets covering Denmark using different gauge systems and comprising both networks of point measurements and gridded data sets are considered. Additionally, the influence of using different performance indices and metrics is addressed. A set of indices ranging from mean to extreme precipitation properties is calculated for all the data sets. For each of the observational data sets, the regional climate models (RCMs) are ranked according to their performance using two different metrics. These are based on the error in representing the indices and the spatial pattern. In comparison to the mean, extreme precipitation indices are highly dependent on the spatial resolution of the observations. The spatial pattern also shows differences between the observational data sets. These differences have a clear impact on the ranking of the climate models, which is highly dependent on the observational data set, the index and the metric used. The results highlight the need to be aware of the properties of observational data chosen in order to avoid overconfident and misleading conclusions with respect to climate model performance.

Download Full-text

Crime, corruption and the role of institutions

Indian Growth and Development Review ◽

10.1108/igdr-11-2011-0040 ◽

2014 ◽

Vol 7 (1) ◽

pp. 73-95 ◽

Cited By ~ 2

Author(s):

Ishita Chatterjee ◽

Ranjan Ray

Keyword(s):

Negative Impact ◽

Probit Model ◽

Data Sets ◽

Older Individuals ◽

Crime Victim ◽

Data Set ◽

Content Type ◽

Country Level ◽

Level Data

Purpose – There have been very few attempts in the economics literature to empirically study the link between criminal and corrupt behaviour due to lack of data sets on simultaneous information on both types of illegitimate activities. The paper aims to discuss these issues. Design/methodology/approach – The present study uses a large cross-country data set containing individual responses to questions on crime and corruption along with information on the respondents' characteristics. These micro-level data are supplemented by country-level macro and institutional indicators. A methodological contribution of this study is the estimation of an ordered probit model based on outcomes defined as combinations of crime and bribe victimisation. Findings – The authors find that: a crime victim is more likely to face bribe demands, males are more likely victims of corruption while females are of serious crime, older individuals and those living in the smaller towns are less exposed to crime and corruption, increasing levels of income and education increase the likelihood of crime and bribe victimisation to be reported and a stronger legal system and a happier society reduce both crime and corruption. However, the authors find no evidence of a strong and uniformly negative impact of either crime or corruption on a country's growth rate. Originality/value – This paper is, to the authors' knowledge, the first in the literature to explore the nexus between crime and corruption, their magnitudes, determinants and their effects on growth rates.

Download Full-text

Evaluation of a Model for Predicting the Drift of Iceberg Ensembles

Journal of Offshore Mechanics and Arctic Engineering ◽

10.1115/1.3257047 ◽

1988 ◽

Vol 110 (2) ◽

pp. 172-179 ◽

Cited By ~ 2

Author(s):

H. El-Tahan ◽

S. Venkatesh ◽

M. El-Tahan

Keyword(s):

East Coast ◽

Current System ◽

Model Performance ◽

Critical Examination ◽

Data Sets ◽

Data Set ◽

Grand Banks ◽

Qualitative And Quantitative ◽

Large Numbers ◽

Grid Block

This paper describes the evaluation of a model for predicting the drift of iceberg ensembles. The model was developed in preparation for providing an iceberg forecasting service off the Canadian east coast north of about 45°N. It was envisaged that 1–5 day forecasts of iceberg ensemble drift will be available. Following a critical examination of all available data, 10 data sets containing up to 404 icebergs in the Grand Banks area off Newfoundland were selected for detailed study. The winds measured in the vicinity of the study area as well as the detailed current system developed by the International Ice Patrol were used as inputs to the model. A discussion on the accuracy and limitations of the input data is presented. Qualitative and quantitative criteria were used to evaluate model performance. Applying these criteria to the results of the computer simulations, it is shown that the model provides good predictions. The degree of predictive success varied from one data set to another. The study demonstrated the validity of the assumption of random positioning for icebergs within a grid block, especially for ensembles with large numbers of icebergs. It was found that an “average” iceberg size can be used to represent all icebergs. The study also showed that in order to achieve improved results it will be necessary to account for the deterioration (complete melting of icebergs), especially during the summer months.

Download Full-text

EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life

10.1101/2020.06.30.180687 ◽

2020 ◽

Cited By ~ 4

Author(s):

Daniel J. Richter ◽

Cédric Berney ◽

Jürgen F. H. Strassert ◽

Fabien Burki ◽

Colomban de Vargas

Keyword(s):

Gene Family ◽

Gene Family Evolution ◽

Data Sets ◽

Data Set ◽

Eukaryotic Diversity ◽

Persistent Identifier ◽

Taxonomic Framework ◽

Genome Scale

AbstractEukProt is a database of published and publicly available predicted protein sets and unannotated genomes selected to represent eukaryotic diversity, including 742 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for studies in phylogenomics, gene family evolution, and other gene-based research across the spectrum of eukaryotic life. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is currently in version 2, and all versions will be permanently stored and made available via FigShare. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

Download Full-text

Impact of Baseline Clinical and Radiological Features on Outcome of Chronic Rhinosinusitis in Granulomatosis with Polyangiitis

10.21203/rs.3.rs-67859/v1 ◽

2020 ◽

Author(s):

Sigrun Skaar Holme ◽

Karin Kilian ◽

Heidi B. Eggesbø ◽

Jon Magnus Moen ◽

Øyvind Molberg

Keyword(s):

Chronic Rhinosinusitis ◽

Granulomatosis With Polyangiitis ◽

Negative Impact ◽

Renal Involvement ◽

Ct Scans ◽

Data Sets ◽

Data Set ◽

Mucosal Disease ◽

Life Threatening ◽

Observation Period

Abstract Background: Granulomatosis with polyangiitis (GPA) causes a recurring inflammation in nose and paranasal sinuses that clinically resembles chronic rhinosinusitis (CRS) of other aetiologies. While sinonasal inflammation is not among the life-threatening features of GPA, patients report it to have major negative impact on quality of life. A relatively large proportion of GPA patients have severe CRS with extensive damage to nose and sinus structures evident by CT, but risk factors for severe CRS development remain largely unknown. In this study, we aimed to identify clinical and radiological predictors of CRS-related damage in GPA. Methods: We included GPA patients who had clinical data sets from time of diagnosis, and two or more paranasal sinus CT scans obtained ≥ 12 months apart available for analysis. We defined time from first to last CT as the study observation period, and evaluated CRS development across this period by CT scores for inflammatory sinus bone thickening (osteitis), bone destructions and sinus opacifications (here defined as mucosal disease). In logistic regression, we applied osteitis as main outcome measure for CRS-related damage.Results: We evaluated 697 CT scans obtained over median 5 years observation from 116 GPA patients. We found that 39% (45/116) of the GPA patients remained free from CRS damage across the study observation period, while 33% (38/116) had progressive damage. By end of observation, 32% (37/116) of the GPA patients had developed severe osteitis. We identified mucosal disease at baseline as a predictor for osteitis (Odds Ratio 1.34), and we found that renal involvement at baseline was less common in patients with severe osteitis at last CT (41%, 15/37) than in patients with no osteitis (60%, 27/45). Conclusions: In this largely unselected GPA patient cohort, baseline sinus mucosal disease associated with CRS-related damage, as measured by osteitis at end of follow-up. We found no significant association with clinical factors, but the data set indicated an inverse relationship between renal involvement and severe sinonasal affliction.

Download Full-text