Excluding loci with substitution saturation improves inferences from phylogenomic data

AbstractThe historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences.

Download Full-text

Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data

Systematic Biology ◽

10.1093/sysbio/syab075 ◽

2021 ◽

Author(s):

David A Duchêne ◽

Niklas Mather ◽

Cara Van Der Wal ◽

Simon Y W Ho

Keyword(s):

Negative Impact ◽

Effective Means ◽

Model Performance ◽

Phylogenetic Inference ◽

Nucleotide Sequences ◽

Data Sets ◽

Data Set ◽

Deep Time ◽

Substitution Saturation ◽

Genome Scale

Abstract The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.]

Download Full-text

Using target enrichment sequencing to study the higher-level phylogeny of the largest lichen-forming fungi family: Parmeliaceae (Ascomycota)

IMA Fungus ◽

10.1186/s43008-020-00051-x ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Felix Grewe ◽

Claudio Ametrano ◽

Todd J. Widhelm ◽

Steven Leavitt ◽

Isabel Distefano ◽

...

Keyword(s):

Data Sets ◽

Target Enrichment ◽

Data Set ◽

Reduced Genome ◽

Genome Data ◽

Worldwide Distribution ◽

Phylogenetic Studies ◽

Genome Scale ◽

Scale Data ◽

Promising Avenue

AbstractParmeliaceae is the largest family of lichen-forming fungi with a worldwide distribution. We used a target enrichment data set and a qualitative selection method for 250 out of 350 genes to infer the phylogeny of the major clades in this family including 81 taxa, with both subfamilies and all seven major clades previously recognized in the subfamily Parmelioideae. The reduced genome-scale data set was analyzed using concatenated-based Bayesian inference and two different Maximum Likelihood analyses, and a coalescent-based species tree method. The resulting topology was strongly supported with the majority of nodes being fully supported in all three concatenated-based analyses. The two subfamilies and each of the seven major clades in Parmelioideae were strongly supported as monophyletic. In addition, most backbone relationships in the topology were recovered with high nodal support. The genus Parmotrema was found to be polyphyletic and consequently, it is suggested to accept the genus Crespoa to accommodate the species previously placed in Parmotrema subgen. Crespoa. This study demonstrates the power of reduced genome-scale data sets to resolve phylogenetic relationships with high support. Due to lower costs, target enrichment methods provide a promising avenue for phylogenetic studies including larger taxonomic/specimen sampling than whole genome data would allow.

Download Full-text

Optimal Rates for Phylogenetic Inference and Experimental Design in the Era of Genome-Scale Data Sets

Systematic Biology ◽

10.1093/sysbio/syy047 ◽

2018 ◽

Vol 68 (1) ◽

pp. 145-156 ◽

Cited By ~ 16

Author(s):

Alex Dornburg ◽

Zhuo Su ◽

Jeffrey P Townsend

Keyword(s):

Experimental Design ◽

Phylogenetic Inference ◽

Data Sets ◽

Genome Scale ◽

Scale Data

Download Full-text

JTK_CYCLE: An Efficient Nonparametric Algorithm for Detecting Rhythmic Components in Genome-Scale Data Sets

Journal of Biological Rhythms ◽

10.1177/0748730410379711 ◽

2010 ◽

Vol 25 (5) ◽

pp. 372-380 ◽

Cited By ~ 485

Author(s):

Michael E. Hughes ◽

John B. Hogenesch ◽

Karl Kornacker

Keyword(s):

Data Sets ◽

Nonparametric Algorithm ◽

Genome Scale ◽

Scale Data

Download Full-text

Impact of baseline clinical and radiological features on outcome of chronic rhinosinusitis in granulomatosis with polyangiitis

10.21203/rs.3.rs-67859/v2 ◽

2020 ◽

Author(s):

Sigrun Skaar Holme ◽

Karin Kilian ◽

Heidi B. Eggesbø ◽

Jon Magnus Moen ◽

Øyvind Molberg

Keyword(s):

Chronic Rhinosinusitis ◽

Granulomatosis With Polyangiitis ◽

Negative Impact ◽

Renal Involvement ◽

Ct Scans ◽

Data Sets ◽

Data Set ◽

Mucosal Disease ◽

Life Threatening ◽

Observation Period

Abstract Background: Granulomatosis with polyangiitis (GPA) causes a recurring inflammation in nose and paranasal sinuses that clinically resembles chronic rhinosinusitis (CRS) of other aetiologies. While sinonasal inflammation is not among the life-threatening features of GPA, patients report it to have major negative impact on quality of life. A relatively large proportion of GPA patients have severe CRS with extensive damage to nose and sinus structures evident by CT, but risk factors for severe CRS development remain largely unknown. In this study, we aimed to identify clinical and radiological predictors of CRS-related damage in GPA.Methods: We included GPA patients who had clinical data sets from time of diagnosis, and two or more paranasal sinus CT scans obtained ≥ 12 months apart available for analysis. We defined time from first to last CT as the study observation period, and evaluated CRS development across this period using CT scores for inflammatory sinus bone thickening (osteitis), bone destructions and sinus opacifications (here defined as mucosal disease). In logistic regression, we applied osteitis as main outcome measure for CRS-related damage.Results: We evaluated 697 CT scans obtained over median 5 years observation from 116 GPA patients. We found that 39% (45/116) of the GPA patients remained free from CRS damage across the study observation period, while 33% (38/116) had progressive damage. By end of observation, 32% (37/116) of the GPA patients had developed severe osteitis. We identified mucosal disease at baseline as a predictor for osteitis (Odds Ratio 1.33), and we found that renal involvement at baseline was less common in patients with severe osteitis at last CT (41%, 15/37) than in patients with no osteitis (60%, 27/45).Conclusions: In this largely unselected GPA patient cohort, baseline sinus mucosal disease associated with CRS-related damage, as measured by osteitis at end of follow-up. We found no significant association with clinical factors, but the data set indicated an inverse relationship between renal involvement and severe sinonasal affliction.

Download Full-text

Discovery of pathways using multiple genome-scale data sets

10.1240/sav_gbm_2005_h_001274 ◽

2005 ◽

Vol 2005 (Fall) ◽

Author(s):

Benno Schwikowski

Keyword(s):

Data Sets ◽

Multiple Genome ◽

Genome Scale ◽

Scale Data

Download Full-text

Apple Disease Recognition Based on Small-scale Data Sets

Applied Engineering in Agriculture ◽

10.13031/aea.14187 ◽

2021 ◽

Vol 37 (3) ◽

pp. 481-490

Author(s):

Chenyong Song ◽

Dongwei Wang ◽

Haoran Bai ◽

Weihao Sun

Keyword(s):

Identification Accuracy ◽

Model Complexity ◽

Geometric Transformation ◽

Small Scale ◽

Model Parameters ◽

Data Sets ◽

List Type ◽

Data Set ◽

Disease Identification ◽

Scale Data

HighlightsThe proposed data enhancement method can be used for small-scale data sets with rich sample image features.The accuracy of the new model reaches 98.5%, which is better than the traditional CNN method.Abstract: GoogLeNet offers far better performance in identifying apple disease compared to traditional methods. However, the complexity of GoogLeNet is relatively high. For small volumes of data, GoogLeNet does not achieve the same performance as it does with large-scale data. We propose a new apple disease identification model using GoogLeNet’s inception module. The model adopts a variety of methods to optimize its generalization ability. First, geometric transformation and image modification of data enhancement methods (including rotation, scaling, noise interference, random elimination, color space enhancement) and random probability and appropriate combination of strategies are used to amplify the data set. Second, we employ a deep convolution generative adversarial network (DCGAN) to enhance the richness of generated images by increasing the diversity of the noise distribution of the generator. Finally, we optimize the GoogLeNet model structure to reduce model complexity and model parameters, making it more suitable for identifying apple tree diseases. The experimental results show that our approach quickly detects and classifies apple diseases including rust, spotted leaf disease, and anthrax. It outperforms the original GoogLeNet in recognition accuracy and model size, with identification accuracy reaching 98.5%, making it a feasible method for apple disease classification. Keywords: Apple disease identification, Data enhancement, DCGAN, GoogLeNet.

Download Full-text

Landscape Influences on Stream Habitats and Biological Assemblages

Landscape Influences on Stream Habitats and Biological Assemblages ◽

10.47886/9781888569766.ch27 ◽

2006 ◽

Keyword(s):

Land Use ◽

Point Source ◽

Agricultural Land ◽

Biological Assessment ◽

Agricultural Land Use ◽

Data Sets ◽

Fine Scale ◽

Data Set ◽

Scale Data ◽

Point Source Discharges

Abstract.—We used data sets of differing geographic extents and sampling intensities to examine how data structure affects the outcome of biological assessment. An intensive sampling (n = 97) of the Muskegon River basin provided our example of fine scale data, while two regional and statewide data sets (n = 276, 310) represented data sets of coarser geographic scales. We constructed significant multiple linear regression models (R2 from 21% to 79%) to predict expected fish assemblage metrics (total fish, game fish, intolerant fish, and benthic fish species richness) and to regionally normalize Muskegon basin samples. We then examined the sensitivity of assessments based on each of five data sets with differing geographic extents to landscape stressors (urban and agricultural land use, dam density, and point source discharges). Assessment scores generated from the different data extents were significantly correlated and suggested that the Muskegon basin was generally in good condition. However, using coarser scale data extents to determine reference conditions resulted in greater sensitivity to land-use stressors (urban and agricultural land use). This was due in part to significant covariance between land use and drainage area in the fine scale data set. Our results show that the scale of data used to determine reference condition can significantly influence the results of a biological assessment. The training data sets with broader spatial range appeared to produce the most sensitive and accurate catchment assessment. A covariance structure analysis using a data set with broad spatial range suggested that impounded channels and point source discharges have the strongest negative effects on intolerant fish richness in the Muskegon River basin, which provides a focus for conservation, mitigation, and rehabilitation opportunities.

Download Full-text

GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images

Remote Sensing ◽

10.3390/rs12111794 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1794

Author(s):

Naisen Yang ◽

Hong Tang

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Large Scale ◽

Semantic Segmentation ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Learning Tasks ◽

Global Mapping ◽

Scale Data

Modern convolutional neural networks (CNNs) are often trained on pre-set data sets with a fixed size. As for the large-scale applications of satellite images, for example, global or regional mappings, these images are collected incrementally by multiple stages in general. In other words, the sizes of training datasets might be increased for the tasks of mapping rather than be fixed beforehand. In this paper, we present a novel algorithm, called GeoBoost, for the incremental-learning tasks of semantic segmentation via convolutional neural networks. Specifically, the GeoBoost algorithm is trained in an end-to-end manner on the newly available data, and it does not decrease the performance of previously trained models. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. This method avoids the need for training on the enlarged data set from scratch and would become more effective along with more available data.

Download Full-text

Crime, corruption and the role of institutions

Indian Growth and Development Review ◽

10.1108/igdr-11-2011-0040 ◽

2014 ◽

Vol 7 (1) ◽

pp. 73-95 ◽

Cited By ~ 2

Author(s):

Ishita Chatterjee ◽

Ranjan Ray

Keyword(s):

Negative Impact ◽

Probit Model ◽

Data Sets ◽

Older Individuals ◽

Crime Victim ◽

Data Set ◽

Content Type ◽

Country Level ◽

Level Data

Purpose – There have been very few attempts in the economics literature to empirically study the link between criminal and corrupt behaviour due to lack of data sets on simultaneous information on both types of illegitimate activities. The paper aims to discuss these issues. Design/methodology/approach – The present study uses a large cross-country data set containing individual responses to questions on crime and corruption along with information on the respondents' characteristics. These micro-level data are supplemented by country-level macro and institutional indicators. A methodological contribution of this study is the estimation of an ordered probit model based on outcomes defined as combinations of crime and bribe victimisation. Findings – The authors find that: a crime victim is more likely to face bribe demands, males are more likely victims of corruption while females are of serious crime, older individuals and those living in the smaller towns are less exposed to crime and corruption, increasing levels of income and education increase the likelihood of crime and bribe victimisation to be reported and a stronger legal system and a happier society reduce both crime and corruption. However, the authors find no evidence of a strong and uniformly negative impact of either crime or corruption on a country's growth rate. Originality/value – This paper is, to the authors' knowledge, the first in the literature to explore the nexus between crime and corruption, their magnitudes, determinants and their effects on growth rates.

Download Full-text