scholarly journals NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

2021 ◽  
Author(s):  
Thiago Peixoto Leal ◽  
Vinicius C Furlan ◽  
Mateus Henrique Gouveia ◽  
Julia Maria Saraiva Duarte ◽  
Pablo AS Fonseca ◽  
...  

Genetic and omics analyses frequently require independent observations, which is not guaranteed in real datasets. When relatedness can not be accounted for, solutions involve removing related individuals (or observations) and, consequently, a reduction of available data. We developed a network-based relatedness-pruning method that minimizes dataset reduction while removing unwanted relationships in a dataset. It uses node degree centrality metric to identify highly connected nodes (or individuals) and implements heuristics that approximate the minimal reduction of a dataset to allow its application to large datasets. NAToRA outperformed two popular methodologies (implemented in software PLINK and KING) by showing the best combination of effective relatedness-pruning, removing all relatives while keeping the largest possible number of individuals in all datasets tested and also, with similar or lesser reduction in genetic diversity. NAToRA is freely available, both as a standalone tool that can be easily incorporated as part of a pipeline, and as a graphical web tool that allows visualization of the relatedness networks. NAToRA also accepts a variety of relationship metrics as input, which facilitates its use. We also present a genealogies simulator software used for different tests performed in the manuscript.

Author(s):  
Natarajan Meghanathan

The authors present correlation analysis between the centrality values observed for nodes (a computationally lightweight metric) and the maximal clique size (a computationally hard metric) that each node is part of in complex real-world network graphs. They consider the four common centrality metrics: degree centrality (DegC), eigenvector centrality (EVC), closeness centrality (ClC), and betweenness centrality (BWC). They define the maximal clique size for a node as the size of the largest clique (in terms of the number of constituent nodes) the node is part of. The real-world network graphs studied range from regular random network graphs to scale-free network graphs. The authors observe that the correlation between the centrality value and the maximal clique size for a node increases with increase in the spectral radius ratio for node degree, which is a measure of the variation of the node degree in the network. They observe the degree-based centrality metrics (DegC and EVC) to be relatively better correlated with the maximal clique size compared to the shortest path-based centrality metrics (ClC and BWC).


Author(s):  
Natarajan Meghanathan

We present correlation analysis between the centrality values observed for nodes (a computationally lightweight metric) and the maximal clique size (a computationally hard metric) that each node is part of in complex real-world network graphs. We consider the four common centrality metrics: degree centrality (DegC), eigenvector centrality (EVC), closeness centrality (ClC) and betweenness centrality (BWC). We define the maximal clique size for a node as the size of the largest clique (in terms of the number of constituent nodes) the node is part of. The real-world network graphs studied range from regular random network graphs to scale-free network graphs. We observe that the correlation between the centrality value and the maximal clique size for a node increases with increase in the spectral radius ratio for node degree, which is a measure of the variation of the node degree in the network. We observe the degree-based centrality metrics (DegC and EVC) to be relatively better correlated with the maximal clique size compared to the shortest path-based centrality metrics (ClC and BWC).


Database ◽  
2016 ◽  
Vol 2016 ◽  
Author(s):  
Hans Ienasescu ◽  
Kang Li ◽  
Robin Andersson ◽  
Morana Vitezic ◽  
Sarah Rennie ◽  
...  

Genomics consortia have produced large datasets profiling the expression of genes, micro-RNAs, enhancers and more across human tissues or cells. There is a need for intuitive tools to select subsets of such data that is the most relevant for specific studies. To this end, we present SlideBase, a web tool which offers a new way of selecting genes, promoters, enhancers and microRNAs that are preferentially expressed/used in a specified set of cells/tissues, based on the use of interactive sliders. With the help of sliders, SlideBase enables users to define custom expression thresholds for individual cell types/tissues, producing sets of genes, enhancers etc. which satisfy these constraints. Changes in slider settings result in simultaneous changes in the selected sets, updated in real time. SlideBase is linked to major databases from genomics consortia, including FANTOM, GTEx, The Human Protein Atlas and BioGPS. Database URL: http://slidebase.binf.ku.dk


2017 ◽  
Vol 10 (2) ◽  
pp. 52
Author(s):  
Natarajan Meghanathan

Results of correlation study (using Pearson's correlation coefficient, PCC) between decay centrality (DEC) vs. degree centrality (DEG) and closeness centrality (CLC) for a suite of 48 real-world networks indicate an interesting trend: PCC(DEC, DEG) decreases with increase in the decay parameter δ (0 < δ < 1) and PCC(DEC, CLC) decreases with decrease in δ. We make use of this trend of monotonic decrease in the PCC values (from both sides of the δ-search space) and propose a binary search algorithm that (given a threshold value r for the PCC) could be used to identify a value of δ (if one exists, we say there exists a positive δ-spacer) for a real-world network such that PCC(DEC, DEG) ≥ r as well as PCC(DEC, CLC) ≥ r. We show the use of the binary search algorithm to find the maximum Threshold PCC value rmax (such that δ-spacermax is positive) for a real-world network. We observe a very strong correlation between rmax and PCC(DEG, CLC) as well as observe real-world networks with a larger variation in node degree to more likely have a lower rmax value and vice-versa.


2016 ◽  
Author(s):  
Suyash S. Shringarpure ◽  
Carlos D. Bustamante ◽  
Kenneth L. Lange ◽  
David H. Alexander

Background: A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data. Results: We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5x speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension. Conclusions: These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.


2017 ◽  
Author(s):  
Fabrizio Mafessoni ◽  
Rashmi B Prasad ◽  
Leif Groop ◽  
Ola Hansson ◽  
Kay Prüfer

AbstractIt is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are less often found than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.


2018 ◽  
Vol 10 (12) ◽  
pp. 4480 ◽  
Author(s):  
Na Zhang ◽  
Yu Yang ◽  
Jianxin Wang ◽  
Baodong Li ◽  
Jiafu Su

Changes in customer needs are unavoidable during the design process of complex mechanical products, and may bring severely negative impacts on product design, such as extra costs and delays. One of the effective ways to prevent and reduce these negative impacts is to evaluate and manage the core parts of the product. Therefore, in this paper, a modified Dempster-Shafer (D-S) evidential approach is proposed for identifying the core parts. Firstly, an undirected weighted network model is constructed to systematically describe the product structure. Secondly, a modified D-S evidential approach is proposed to systematically and scientifically evaluate the core parts, which takes into account the degree of the nodes, the weights of the nodes, the positions of the nodes, and the global information of the network. Finally, the evaluation of the core parts of a wind turbine is carried out to illustrate the effectiveness of the proposed method in the paper. The results show that the modified D-S evidential approach achieves better performance regarding the evaluation of core parts than the node degree centrality measure, node betweenness centrality measure, and node closeness centrality measure.


2020 ◽  
Vol 36 (16) ◽  
pp. 4519-4520
Author(s):  
Ying Zhou ◽  
Sharon R Browning ◽  
Brian L Browning

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document