scholarly journals Gene Similarity Networks Unveil a Potential Novel Unicellular Group Closely Related to Animals from the Tara Oceans Expedition

2020 ◽  
Vol 12 (9) ◽  
pp. 1664-1678 ◽  
Author(s):  
Alicia S Arroyo ◽  
Romain Iannes ◽  
Eric Bapteste ◽  
Iñaki Ruiz-Trillo

Abstract The Holozoa clade comprises animals and several unicellular lineages (choanoflagellates, filastereans, and teretosporeans). Understanding their full diversity is essential to address the origins of animals and other evolutionary questions. However, they are poorly known. To provide more insights into the real diversity of holozoans and check for undiscovered diversity, we here analyzed 18S rDNA metabarcoding data from the global Tara Oceans expedition. To overcome the low phylogenetic information contained in the metabarcoding data set (composed of sequences from the short V9 region of the gene), we used similarity networks by combining two data sets: unknown environmental sequences from Tara Oceans and known reference sequences from GenBank. We then calculated network metrics to compare environmental sequences with reference sequences. These metrics reflected the divergence between both types of sequences and provided an effective way to search for evolutionary relevant diversity, further validated by phylogenetic placements. Our results showed that the percentage of unicellular holozoan diversity remains hidden. We found novelties in several lineages, especially in Acanthoecida choanoflagellates. We also identified a potential new holozoan group that could not be assigned to any of the described extant clades. Data on geographical distribution showed that, although ubiquitous, each unicellular holozoan lineage exhibits a different distribution pattern. We also identified a positive association between new animal hosts and the ichthyosporean symbiont Creolimax fragrantissima, as well as for other holozoans previously reported as free-living. Overall, our analyses provide a fresh perspective into the diversity and ecology of unicellular holozoans, highlighting the amount of undescribed diversity.


Paleobiology ◽  
2010 ◽  
Vol 36 (2) ◽  
pp. 253-282 ◽  
Author(s):  
Philip D. Mannion ◽  
Paul Upchurch

Both the body fossils and trackways of sauropod dinosaurs indicate that they inhabited a range of inland and coastal environments during their 160-Myr evolutionary history. Quantitative paleoecological analyses of a large data set of sauropod occurrences reveal a statistically significant positive association between non-titanosaurs and coastal environments, and between titanosaurs and inland environments. Similarly, “narrow-gauge” trackways are positively associated with coastal environments and “wide-gauge” trackways are associated with inland environments. The statistical support for these associations suggests that this is a genuine ecological signal: non-titanosaur sauropods preferred coastal environments such as carbonate platforms, whereas titanosaurs preferred inland environments such as fluvio-lacustrine systems. These results remain robust when the data set is time sliced and jackknifed in various ways. When the analyses are repeated using the more inclusive groupings of titanosauriforms and Macronaria, the signal is weakened or lost. These results reinforce the hypothesis that “wide-gauge” trackways were produced by titanosaurs. It is commonly assumed that the trackway and body fossil records will give different results, with the former providing a more reliable guide to the habitats occupied by extinct organisms because footprints are produced during life, whereas carcasses can be transported to different environments prior to burial. However, this view is challenged by our observation that separate body fossil and trackway data sets independently support the same conclusions regarding environmental preferences in sauropod dinosaurs. Similarly, analyzing localities and individuals independently results in the same environmental associations. We demonstrate that conclusions about environmental patterns among fossil taxa can be highly sensitive to an investigator's choices regarding analytical protocols. In particular, decisions regarding the taxonomic groupings used for comparison, the time range represented by the data set, and the criteria used to identify the number of localities can all have a marked effect on conclusions regarding the existence and nature of putative environmental associations. We recommend that large data sets be explored for such associations at a variety of different taxonomic and temporal scales.



2016 ◽  
Author(s):  
Grasiela Casas ◽  
Vinicius A.G. Bastazini ◽  
Vanderlei J. Debastiani ◽  
Valério D. Pillar

AbstractSampling the full diversity of interactions in an ecological community is a highly intensive effort. Recent studies have demonstrated that many network metrics are sensitive to both sampling effort and network size. Here, we develop a statistical framework, based on bootstrap resampling, that aims to assess sampling sufficiency for some of the most widely used metrics in network ecology, namely connectance, nestedness (NODF-nested overlap and decreasing fill) and modularity (using the QuaBiMo algorithm). Our framework can generate confidence intervals for each network metric with increasing sample size (i.e., the number of sampled interaction events, or number of sampled individuals), which can be used to evaluate sampling sufficiency. The sample is considered sufficient when the confidence limits reach stability or lie within an acceptable level of precision for the aims of the study. We illustrate our framework with data from three quantitative networks of plant and frugivorous birds, varying in size from 16 to 115 species, and 17 to 2,745 interactions. These data sets illustrate that, for the same dataset, sampling sufficiency may be reached at different sample sizes depending on the metric of interest. The bootstrap confidence limits reached stability for the two largest networks, but were wide and unstable with increasing sample size for all three metrics estimated based on the smallest network. The bootstrap method is useful to empirical ecologists to indicate the minimum number of interactions necessary to reach sampling sufficiency for a specific network metric. It is also useful to compare sampling techniques of networks in their capacity to reach sampling sufficiency. Our method is general enough to be applied to different types of metrics and networks.



Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Carlota Cardoso ◽  
Rita T Sousa ◽  
Sebastian Köhler ◽  
Catia Pesquita

Abstract The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.



2013 ◽  
Vol 22 (05) ◽  
pp. 1360001
Author(s):  
XIAOCONG FAN ◽  
MENG SU

Diffusion geometry offers a fresh perspective on multi-scale information analysis, which is critical to multiagent systems that need to process massive data sets. A recent study has shown that when the "diffusion distance" concept is applied to human decision experiences, its performance on solution synthesis can be significantly better than using Euclidean distance. However, as a data set expands over time, it can quickly exceed the processing capacity of a single agent. In this paper, we proposed a multi-agent diffusion approach where a massive data set is split into several subsets and each diffusion agent only needs to work with one subset in diffusion computation. We conducted experiments with different splitting strategies applied to a set of decision experiences. The result indicates that the multi-agent diffusion approach is beneficial, and it is even possible to benefit from using a larger group of diffusion agents if their subsets have common experiences and pairly-shared experiences. Our study also shows that system performance could be affected significantly by the splitting granularity (size of each splitting unit). This study paves the road for applying the multi-agent diffusion approach to massive data analysis.



2018 ◽  
Vol 154 (2) ◽  
pp. 149-155
Author(s):  
Michael Archer

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.



2018 ◽  
Vol 21 (2) ◽  
pp. 117-124 ◽  
Author(s):  
Bakhtyar Sepehri ◽  
Nematollah Omidikia ◽  
Mohsen Kompany-Zareh ◽  
Raouf Ghavami

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.



Author(s):  
Kyungkoo Jun

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.



2019 ◽  
Vol 73 (8) ◽  
pp. 893-901
Author(s):  
Sinead J. Barton ◽  
Bryan M. Hennelly

Cosmic ray artifacts may be present in all photo-electric readout systems. In spectroscopy, they present as random unidirectional sharp spikes that distort spectra and may have an affect on post-processing, possibly affecting the results of multivariate statistical classification. A number of methods have previously been proposed to remove cosmic ray artifacts from spectra but the goal of removing the artifacts while making no other change to the underlying spectrum is challenging. One of the most successful and commonly applied methods for the removal of comic ray artifacts involves the capture of two sequential spectra that are compared in order to identify spikes. The disadvantage of this approach is that at least two recordings are necessary, which may be problematic for dynamically changing spectra, and which can reduce the signal-to-noise (S/N) ratio when compared with a single recording of equivalent duration due to the inclusion of two instances of read noise. In this paper, a cosmic ray artefact removal algorithm is proposed that works in a similar way to the double acquisition method but requires only a single capture, so long as a data set of similar spectra is available. The method employs normalized covariance in order to identify a similar spectrum in the data set, from which a direct comparison reveals the presence of cosmic ray artifacts, which are then replaced with the corresponding values from the matching spectrum. The advantage of the proposed method over the double acquisition method is investigated in the context of the S/N ratio and is applied to various data sets of Raman spectra recorded from biological cells.



2013 ◽  
Vol 756-759 ◽  
pp. 3652-3658
Author(s):  
You Li Lu ◽  
Jun Luo

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.



2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.



Sign in / Sign up

Export Citation Format

Share Document