scholarly journals The potential of family-free rearrangements towards gene orthology inference

Author(s):  
Diego P. Rubert ◽  
Daniel Doerr ◽  
Marília D. V. Braga

Recently, we proposed an efficient ILP formulation [Rubert DP, Martinez FV, Braga MDV, Natural family-free genomic distance, Algorithms Mol Biol 16:4, 2021] for exactly computing the rearrangement distance of two genomes in a family-free setting. In such a setting, neither prior classification of genes into families, nor further restrictions on the genomes are imposed. Given two genomes, the mentioned ILP computes an optimal matching of the genes taking into account simultaneously local mutations, given by gene similarities, and large-scale genome rearrangements. Here, we explore the potential of using this ILP for inferring groups of orthologs across several species. More precisely, given a set of genomes, our method first computes all pairwise optimal gene matchings, which are then integrated into gene families in the second step. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities. It can be downloaded from gitlab.ub.uni-bielefeld.de/gi/FFGC. We obtained promising results with experiments on both simulated and real data.

mBio ◽  
2018 ◽  
Vol 9 (1) ◽  
Author(s):  
Xyrus X. Maurer-Alcalá ◽  
Rob Knight ◽  
Laura A. Katz

ABSTRACTSeparate germline and somatic genomes are found in numerous lineages across the eukaryotic tree of life, often separated into distinct tissues (e.g., in plants, animals, and fungi) or distinct nuclei sharing a common cytoplasm (e.g., in ciliates and some foraminifera). In ciliates, germline-limited (i.e., micronuclear-specific) DNA is eliminated during the development of a new somatic (i.e., macronuclear) genome in a process that is tightly linked to large-scale genome rearrangements, such as deletions and reordering of protein-coding sequences. Most studies of germline genome architecture in ciliates have focused on the model ciliatesOxytricha trifallax,Paramecium tetraurelia, andTetrahymena thermophila, for which the complete germline genome sequences are known. Outside of these model taxa, only a few dozen germline loci have been characterized from a limited number of cultivable species, which is likely due to difficulties in obtaining sufficient quantities of “purified” germline DNA in these taxa. Combining single-cell transcriptomics and genomics, we have overcome these limitations and provide the first insights into the structure of the germline genome of the ciliateChilodonella uncinata, a member of the understudied classPhyllopharyngea. Our analyses reveal the following: (i) large gene families contain a disproportionate number of genes from scrambled germline loci; (ii) germline-soma boundaries in the germline genome are demarcated by substantial shifts in GC content; (iii) single-cell omics techniques provide large-scale quality germline genome data with limited effort, at least for ciliates with extensively fragmented somatic genomes. Our approach provides an efficient means to understand better the evolution of genome rearrangements between germline and soma in ciliates.IMPORTANCEOur understanding of the distinctions between germline and somatic genomes in ciliates has largely relied on studies of a few model genera (e.g.,Oxytricha,Paramecium,Tetrahymena). We have used single-cell omics to explore germline-soma distinctions in the ciliateChilodonella uncinata, which likely diverged from the better-studied ciliates ~700 million years ago. The analyses presented here indicate that developmentally regulated genome rearrangements between germline and soma are demarcated by rapid transitions in local GC composition and lead to diversification of protein families. The approaches used here provide the basis for future work aimed at discerning the evolutionary impacts of germline-soma distinctions among diverse ciliates.


2019 ◽  
Author(s):  
Kate B. Cook ◽  
Karine Le Roch ◽  
Jean Philippe Vert ◽  
William Stafford Noble

AbstractChromatin conformation assays such as Hi-C cannot directly measure differences in 3D architecture between cell types or cell states. For this purpose, two or more Hi-C experiments must be carried out, but direct comparison of the resulting Hi-C matrices is confounded by several features of Hi-C data. Most notably, the genomic distance effect, whereby contacts between pairs of genomic loci that are proximal along the chromosome exhibit many more Hi-C contacts that distal pairs of loci, dominates every Hi-C matrix. Furthermore, the form that this distance effect takes often varies between different Hi-C experiments, even between replicate experiments. Thus, a statistical confidence measure designed to identify differential Hi-C contacts must accurately account for the genomic distance effect or risk being misled by large-scale but artifactual differences. ACCOST (Altered Chromatin Conformation STatistics) accomplishes this goal by extending the statistical model employed by DEseq, re-purposing the “size factors,” which were originally developed to account for differences in read depth between samples, to instead model the genomic distance effect. We show via analysis of simulated and real data that ACCOST provides unbiased statistical confidence estimates that compare favorably with competing methods such as diffHiC, FIND, and HiCcompare. ACCOST is freely available with an Apache license at https://bitbucket.org/noblelab/accost.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Diego P. Rubert ◽  
Fábio V. Martinez ◽  
Marília D. V. Braga

Abstract Background A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. Results In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.


2021 ◽  
Author(s):  
Diego P. Rubert ◽  
Fábio V. Martinez ◽  
Marília Braga

Abstract Background: A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkamper etal. (J. Comput. Biol., 2020) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almostempty matchings give smaller distances. Results: In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our modelthen results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger searchspace, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkamper et al. for instances with the same number of multipleconnections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.


2020 ◽  
Vol 48 (5) ◽  
pp. 2303-2311 ◽  
Author(s):  
Kate B Cook ◽  
Borislav H Hristov ◽  
Karine G Le Roch ◽  
Jean Philippe Vert ◽  
William Stafford Noble

Abstract Chromatin conformation assays such as Hi-C cannot directly measure differences in 3D architecture between cell types or cell states. For this purpose, two or more Hi-C experiments must be carried out, but direct comparison of the resulting Hi-C matrices is confounded by several features of Hi-C data. Most notably, the genomic distance effect, whereby contacts between pairs of genomic loci that are proximal along the chromosome exhibit many more Hi-C contacts that distal pairs of loci, dominates every Hi-C matrix. Furthermore, the form that this distance effect takes often varies between different Hi-C experiments, even between replicate experiments. Thus, a statistical confidence measure designed to identify differential Hi-C contacts must accurately account for the genomic distance effect or risk being misled by large-scale but artifactual differences. ACCOST (Altered Chromatin COnformation STatistics) accomplishes this goal by extending the statistical model employed by DEseq, re-purposing the ‘size factors,’ which were originally developed to account for differences in read depth between samples, to instead model the genomic distance effect. We show via analysis of simulated and real data that ACCOST provides unbiased statistical confidence estimates that compare favorably with competing methods such as diffHiC, FIND and HiCcompare. ACCOST is freely available with an Apache license at https://bitbucket.org/noblelab/accost.


Author(s):  
Shahrani Shahbudin ◽  
Zaki Firdaus Mohmad ◽  
Saiful Izwan Suliman ◽  
Murizah Kassim ◽  
Roslina Mohamad

<p>Power Quality has become one of the important issues in modern smart grid environment. Smart grid generally utilizes computational intelligence method from the generation of electricity to electricity distribution to the customers. This is done for the safety, reliability, tenacity and efficiency of the system. The classification of power disturbances has become a major topic in maintaining power quality. These disturbances occur due to faults, natural causes, load switching, energizing transformer, starting large motor, as well as utilization of power electronic devices. The key issue is about maintaining the continuous supply of electricity to the end-users without any problem. If a problem occurs, it might increase the production cost significantly especially to large-scale industries. In this paper, S-transform is used to extract distinctive features of real data from transmission system, and Support Vector Machine was utilized to classify four types PQ disturbances namely, voltage sag, interruption, transient and normal voltage. Results obtained indicate that performance of the One Against One classifier produces high accuracy using k-fold cross validation and RBF kernel.</p>


2009 ◽  
pp. 27-53
Author(s):  
A. Yu. Kudryavtsev

Diversity of plant communities in the nature reserve “Privolzhskaya Forest-Steppe”, Ostrovtsovsky area, is analyzed on the basis of the large-scale vegetation mapping data from 2000. The plant community classi­fication based on the Russian ecologic-phytocoenotic approach is carried out. 12 plant formations and 21 associations are distinguished according to dominant species and a combination of ecologic-phytocoenotic groups of species. A list of vegetation classification units as well as the characteristics of theshrub and woody communities are given in this paper.


1996 ◽  
pp. 64-67 ◽  
Author(s):  
Nguen Nghia Thin ◽  
Nguen Ba Thu ◽  
Tran Van Thuy

The tropical seasonal rainy evergreen broad-leaved forest vegetation of the Cucphoung National Park has been classified and the distribution of plant communities has been shown on the map using the relations of vegetation to geology, geomorphology and pedology. The method of vegetation mapping includes: 1) the identifying of vegetation types in the remote-sensed materials (aerial photographs and satellite images); 2) field work to compile the interpretation keys and to characterize all the communities of a study area; 3) compilation of the final vegetation map using the combined information. In the classification presented a number of different level vegetation units have been identified: formation classes (3), formation sub-classes (3), formation groups (3), formations (4), subformations (10) and communities (19). Communities have been taken as mapping units. So in the vegetation map of the National Park 19 vegetation categories has been shown altogether, among them 13 are natural primary communities, and 6 are the secondary, anthropogenic ones. The secondary succession goes through 3 main stages: grassland herbaceous xerophytic vegetation, xerophytic scrub, dense forest.


Author(s):  
P.L. Nikolaev

This article deals with method of binary classification of images with small text on them Classification is based on the fact that the text can have 2 directions – it can be positioned horizontally and read from left to right or it can be turned 180 degrees so the image must be rotated to read the sign. This type of text can be found on the covers of a variety of books, so in case of recognizing the covers, it is necessary first to determine the direction of the text before we will directly recognize it. The article suggests the development of a deep neural network for determination of the text position in the context of book covers recognizing. The results of training and testing of a convolutional neural network on synthetic data as well as the examples of the network functioning on the real data are presented.


2019 ◽  
pp. 1-13
Author(s):  
Luz Judith Rodríguez-Esparza ◽  
Diana Barraza-Barraza ◽  
Jesús Salazar-Ibarra ◽  
Rafael Gerardo Vargas-Pasaye

Objectives: To identify early suicide risk signs on depressive subjects, so that specialized care can be provided. Various studies have focused on studying expressions on social networks, where users pour their emotions, to determine if they show signs of depression or not. However, they have neglected the quantification of the risk of committing suicide. Therefore, this article proposes a new index for identifying suicide risk in Mexico. Methodology: The proposal index is constructed through opinion mining using Twitter and the Analytic Hierarchy Process. Contribution: Using R statistical package, a study is presented considering real data, making a classification of people according to the obtained index and using information from psychologists. The proposed methodology represents an innovative prevention alternative for suicide.


Sign in / Sign up

Export Citation Format

Share Document