scholarly journals Excluding loci with substitution saturation improves inferences from phylogenomic data

2021 ◽  
Author(s):  
David A. Duchêne ◽  
Niklas Mather ◽  
Cara Van Der Wal ◽  
Simon Y.W. Ho

AbstractThe historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences.

2021 ◽  
Author(s):  
David A Duchêne ◽  
Niklas Mather ◽  
Cara Van Der Wal ◽  
Simon Y W Ho

Abstract The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.]


IMA Fungus ◽  
2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Felix Grewe ◽  
Claudio Ametrano ◽  
Todd J. Widhelm ◽  
Steven Leavitt ◽  
Isabel Distefano ◽  
...  

AbstractParmeliaceae is the largest family of lichen-forming fungi with a worldwide distribution. We used a target enrichment data set and a qualitative selection method for 250 out of 350 genes to infer the phylogeny of the major clades in this family including 81 taxa, with both subfamilies and all seven major clades previously recognized in the subfamily Parmelioideae. The reduced genome-scale data set was analyzed using concatenated-based Bayesian inference and two different Maximum Likelihood analyses, and a coalescent-based species tree method. The resulting topology was strongly supported with the majority of nodes being fully supported in all three concatenated-based analyses. The two subfamilies and each of the seven major clades in Parmelioideae were strongly supported as monophyletic. In addition, most backbone relationships in the topology were recovered with high nodal support. The genus Parmotrema was found to be polyphyletic and consequently, it is suggested to accept the genus Crespoa to accommodate the species previously placed in Parmotrema subgen. Crespoa. This study demonstrates the power of reduced genome-scale data sets to resolve phylogenetic relationships with high support. Due to lower costs, target enrichment methods provide a promising avenue for phylogenetic studies including larger taxonomic/specimen sampling than whole genome data would allow.


2010 ◽  
Vol 25 (5) ◽  
pp. 372-380 ◽  
Author(s):  
Michael E. Hughes ◽  
John B. Hogenesch ◽  
Karl Kornacker

2020 ◽  
Author(s):  
Sigrun Skaar Holme ◽  
Karin Kilian ◽  
Heidi B. Eggesbø ◽  
Jon Magnus Moen ◽  
Øyvind Molberg

Abstract Background: Granulomatosis with polyangiitis (GPA) causes a recurring inflammation in nose and paranasal sinuses that clinically resembles chronic rhinosinusitis (CRS) of other aetiologies. While sinonasal inflammation is not among the life-threatening features of GPA, patients report it to have major negative impact on quality of life. A relatively large proportion of GPA patients have severe CRS with extensive damage to nose and sinus structures evident by CT, but risk factors for severe CRS development remain largely unknown. In this study, we aimed to identify clinical and radiological predictors of CRS-related damage in GPA.Methods: We included GPA patients who had clinical data sets from time of diagnosis, and two or more paranasal sinus CT scans obtained ≥ 12 months apart available for analysis. We defined time from first to last CT as the study observation period, and evaluated CRS development across this period using CT scores for inflammatory sinus bone thickening (osteitis), bone destructions and sinus opacifications (here defined as mucosal disease). In logistic regression, we applied osteitis as main outcome measure for CRS-related damage.Results: We evaluated 697 CT scans obtained over median 5 years observation from 116 GPA patients. We found that 39% (45/116) of the GPA patients remained free from CRS damage across the study observation period, while 33% (38/116) had progressive damage. By end of observation, 32% (37/116) of the GPA patients had developed severe osteitis. We identified mucosal disease at baseline as a predictor for osteitis (Odds Ratio 1.33), and we found that renal involvement at baseline was less common in patients with severe osteitis at last CT (41%, 15/37) than in patients with no osteitis (60%, 27/45).Conclusions: In this largely unselected GPA patient cohort, baseline sinus mucosal disease associated with CRS-related damage, as measured by osteitis at end of follow-up. We found no significant association with clinical factors, but the data set indicated an inverse relationship between renal involvement and severe sinonasal affliction.


2021 ◽  
Vol 37 (3) ◽  
pp. 481-490
Author(s):  
Chenyong Song ◽  
Dongwei Wang ◽  
Haoran Bai ◽  
Weihao Sun

HighlightsThe proposed data enhancement method can be used for small-scale data sets with rich sample image features.The accuracy of the new model reaches 98.5%, which is better than the traditional CNN method.Abstract: GoogLeNet offers far better performance in identifying apple disease compared to traditional methods. However, the complexity of GoogLeNet is relatively high. For small volumes of data, GoogLeNet does not achieve the same performance as it does with large-scale data. We propose a new apple disease identification model using GoogLeNet’s inception module. The model adopts a variety of methods to optimize its generalization ability. First, geometric transformation and image modification of data enhancement methods (including rotation, scaling, noise interference, random elimination, color space enhancement) and random probability and appropriate combination of strategies are used to amplify the data set. Second, we employ a deep convolution generative adversarial network (DCGAN) to enhance the richness of generated images by increasing the diversity of the noise distribution of the generator. Finally, we optimize the GoogLeNet model structure to reduce model complexity and model parameters, making it more suitable for identifying apple tree diseases. The experimental results show that our approach quickly detects and classifies apple diseases including rust, spotted leaf disease, and anthrax. It outperforms the original GoogLeNet in recognition accuracy and model size, with identification accuracy reaching 98.5%, making it a feasible method for apple disease classification. Keywords: Apple disease identification, Data enhancement, DCGAN, GoogLeNet.


<em>Abstract.</em>—We used data sets of differing geographic extents and sampling intensities to examine how data structure affects the outcome of biological assessment. An intensive sampling (<em>n </em>= 97) of the Muskegon River basin provided our example of fine scale data, while two regional and statewide data sets (<em>n </em>= 276, 310) represented data sets of coarser geographic scales. We constructed significant multiple linear regression models (<EM>R</EM><sup>2</sup> from 21% to 79%) to predict expected fish assemblage metrics (total fish, game fish, intolerant fish, and benthic fish species richness) and to regionally normalize Muskegon basin samples. We then examined the sensitivity of assessments based on each of five data sets with differing geographic extents to landscape stressors (urban and agricultural land use, dam density, and point source discharges). Assessment scores generated from the different data extents were significantly correlated and suggested that the Muskegon basin was generally in good condition. However, using coarser scale data extents to determine reference conditions resulted in greater sensitivity to land-use stressors (urban and agricultural land use). This was due in part to significant covariance between land use and drainage area in the fine scale data set. Our results show that the scale of data used to determine reference condition can significantly influence the results of a biological assessment. The training data sets with broader spatial range appeared to produce the most sensitive and accurate catchment assessment. A covariance structure analysis using a data set with broad spatial range suggested that impounded channels and point source discharges have the strongest negative effects on intolerant fish richness in the Muskegon River basin, which provides a focus for conservation, mitigation, and rehabilitation opportunities.


2020 ◽  
Vol 12 (11) ◽  
pp. 1794
Author(s):  
Naisen Yang ◽  
Hong Tang

Modern convolutional neural networks (CNNs) are often trained on pre-set data sets with a fixed size. As for the large-scale applications of satellite images, for example, global or regional mappings, these images are collected incrementally by multiple stages in general. In other words, the sizes of training datasets might be increased for the tasks of mapping rather than be fixed beforehand. In this paper, we present a novel algorithm, called GeoBoost, for the incremental-learning tasks of semantic segmentation via convolutional neural networks. Specifically, the GeoBoost algorithm is trained in an end-to-end manner on the newly available data, and it does not decrease the performance of previously trained models. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. This method avoids the need for training on the enlarged data set from scratch and would become more effective along with more available data.


2014 ◽  
Vol 7 (1) ◽  
pp. 73-95 ◽  
Author(s):  
Ishita Chatterjee ◽  
Ranjan Ray

Purpose – There have been very few attempts in the economics literature to empirically study the link between criminal and corrupt behaviour due to lack of data sets on simultaneous information on both types of illegitimate activities. The paper aims to discuss these issues. Design/methodology/approach – The present study uses a large cross-country data set containing individual responses to questions on crime and corruption along with information on the respondents' characteristics. These micro-level data are supplemented by country-level macro and institutional indicators. A methodological contribution of this study is the estimation of an ordered probit model based on outcomes defined as combinations of crime and bribe victimisation. Findings – The authors find that: a crime victim is more likely to face bribe demands, males are more likely victims of corruption while females are of serious crime, older individuals and those living in the smaller towns are less exposed to crime and corruption, increasing levels of income and education increase the likelihood of crime and bribe victimisation to be reported and a stronger legal system and a happier society reduce both crime and corruption. However, the authors find no evidence of a strong and uniformly negative impact of either crime or corruption on a country's growth rate. Originality/value – This paper is, to the authors' knowledge, the first in the literature to explore the nexus between crime and corruption, their magnitudes, determinants and their effects on growth rates.


Sign in / Sign up

Export Citation Format

Share Document