NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

Genetic and omics analyses frequently require independent observations, which is not guaranteed in real datasets. When relatedness can not be accounted for, solutions involve removing related individuals (or observations) and, consequently, a reduction of available data. We developed a network-based relatedness-pruning method that minimizes dataset reduction while removing unwanted relationships in a dataset. It uses node degree centrality metric to identify highly connected nodes (or individuals) and implements heuristics that approximate the minimal reduction of a dataset to allow its application to large datasets. NAToRA outperformed two popular methodologies (implemented in software PLINK and KING) by showing the best combination of effective relatedness-pruning, removing all relatives while keeping the largest possible number of individuals in all datasets tested and also, with similar or lesser reduction in genetic diversity. NAToRA is freely available, both as a standalone tool that can be easily incorporated as part of a pipeline, and as a graphical web tool that allows visualization of the relatedness networks. NAToRA also accepts a variety of relationship metrics as input, which facilitates its use. We also present a genealogies simulator software used for different tests performed in the manuscript.

Download Full-text

Clique Size and Centrality Metrics for Analysis of Real-World Network Graphs

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch087 ◽

2019 ◽

pp. 1183-1198

Author(s):

Natarajan Meghanathan

Keyword(s):

Real World ◽

Random Network ◽

Maximal Clique ◽

Closeness Centrality ◽

Node Degree ◽

Degree Centrality ◽

Eigenvector Centrality ◽

Scale Free ◽

Centrality Metrics ◽

Free Network

The authors present correlation analysis between the centrality values observed for nodes (a computationally lightweight metric) and the maximal clique size (a computationally hard metric) that each node is part of in complex real-world network graphs. They consider the four common centrality metrics: degree centrality (DegC), eigenvector centrality (EVC), closeness centrality (ClC), and betweenness centrality (BWC). They define the maximal clique size for a node as the size of the largest clique (in terms of the number of constituent nodes) the node is part of. The real-world network graphs studied range from regular random network graphs to scale-free network graphs. The authors observe that the correlation between the centrality value and the maximal clique size for a node increases with increase in the spectral radius ratio for node degree, which is a measure of the variation of the node degree in the network. They observe the degree-based centrality metrics (DegC and EVC) to be relatively better correlated with the maximal clique size compared to the shortest path-based centrality metrics (ClC and BWC).

Download Full-text

Clique Size and Centrality Metrics for Analysis of Real-World Network Graphs

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch565 ◽

2018 ◽

pp. 6507-6521

Author(s):

Natarajan Meghanathan

Keyword(s):

Real World ◽

Random Network ◽

Maximal Clique ◽

Closeness Centrality ◽

Node Degree ◽

Degree Centrality ◽

Eigenvector Centrality ◽

Scale Free ◽

Centrality Metrics ◽

Free Network

We present correlation analysis between the centrality values observed for nodes (a computationally lightweight metric) and the maximal clique size (a computationally hard metric) that each node is part of in complex real-world network graphs. We consider the four common centrality metrics: degree centrality (DegC), eigenvector centrality (EVC), closeness centrality (ClC) and betweenness centrality (BWC). We define the maximal clique size for a node as the size of the largest clique (in terms of the number of constituent nodes) the node is part of. The real-world network graphs studied range from regular random network graphs to scale-free network graphs. We observe that the correlation between the centrality value and the maximal clique size for a node increases with increase in the spectral radius ratio for node degree, which is a measure of the variation of the node degree in the network. We observe the degree-based centrality metrics (DegC and EVC) to be relatively better correlated with the maximal clique size compared to the shortest path-based centrality metrics (ClC and BWC).

Download Full-text

On-the-fly selection of cell-specific enhancers, genes, miRNAs and proteins across the human body using SlideBase

Database ◽

10.1093/database/baw144 ◽

2016 ◽

Vol 2016 ◽

Cited By ~ 10

Author(s):

Hans Ienasescu ◽

Kang Li ◽

Robin Andersson ◽

Morana Vitezic ◽

Sarah Rennie ◽

...

Keyword(s):

Human Body ◽

Individual Cell ◽

Cell Types ◽

Large Datasets ◽

Human Tissues ◽

Web Tool ◽

Expression Of Genes ◽

Micro Rnas ◽

Human Protein Atlas ◽

Selection Of

Genomics consortia have produced large datasets profiling the expression of genes, micro-RNAs, enhancers and more across human tissues or cells. There is a need for intuitive tools to select subsets of such data that is the most relevant for specific studies. To this end, we present SlideBase, a web tool which offers a new way of selecting genes, promoters, enhancers and microRNAs that are preferentially expressed/used in a specified set of cells/tissues, based on the use of interactive sliders. With the help of sliders, SlideBase enables users to define custom expression thresholds for individual cell types/tissues, producing sets of genes, enhancers etc. which satisfy these constraints. Changes in slider settings result in simultaneous changes in the selected sets, updated in real time. SlideBase is linked to major databases from genomics consortia, including FANTOM, GTEx, The Human Protein Atlas and BioGPS. Database URL: http://slidebase.binf.ku.dk

Download Full-text

A Binary Search Algorithm for Correlation Study of Decay Centrality vs. Degree Centrality and Closeness Centrality

Computer and Information Science ◽

10.5539/cis.v10n2p52 ◽

2017 ◽

Vol 10 (2) ◽

pp. 52

Author(s):

Natarajan Meghanathan

Keyword(s):

Real World ◽

Search Algorithm ◽

Threshold Value ◽

Search Space ◽

Correlation Study ◽

Binary Search ◽

Closeness Centrality ◽

Node Degree ◽

Degree Centrality ◽

Binary Search Algorithm

Results of correlation study (using Pearson's correlation coefficient, PCC) between decay centrality (DEC) vs. degree centrality (DEG) and closeness centrality (CLC) for a suite of 48 real-world networks indicate an interesting trend: PCC(DEC, DEG) decreases with increase in the decay parameter δ (0 < δ < 1) and PCC(DEC, CLC) decreases with decrease in δ. We make use of this trend of monotonic decrease in the PCC values (from both sides of the δ-search space) and propose a binary search algorithm that (given a threshold value r for the PCC) could be used to identify a value of δ (if one exists, we say there exists a positive δ-spacer) for a real-world network such that PCC(DEC, DEG) ≥ r as well as PCC(DEC, CLC) ≥ r. We show the use of the binary search algorithm to find the maximum Threshold PCC value rmax (such that δ-spacermax is positive) for a real-world network. We observe a very strong correlation between rmax and PCC(DEG, CLC) as well as observe real-world networks with a larger variation in node degree to more likely have a lower rmax value and vice-versa.

Download Full-text

Weighted node degree centrality for hypergraphs

2013 IEEE 2nd Network Science Workshop (NSW) ◽

10.1109/nsw.2013.6609212 ◽

2013 ◽

Cited By ~ 7

Author(s):

Komal Kapoor ◽

Dhruv Sharma ◽

Jaideep Srivastava

Keyword(s):

Node Degree ◽

Degree Centrality

Download Full-text

Efficient analysis of large datasets and sex bias with ADMIXTURE

10.1101/039347 ◽

2016 ◽

Cited By ~ 1

Author(s):

Suyash S. Shringarpure ◽

Carlos D. Bustamante ◽

Kenneth L. Lange ◽

David H. Alexander

Keyword(s):

Large Datasets ◽

Allele Frequencies ◽

Sex Bias ◽

1000 Genomes Project ◽

1000 Genomes ◽

Males And Females ◽

Related Individuals ◽

Using Data ◽

Human Ancestry ◽

Individual Ancestry

Background: A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data. Results: We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5x speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension. Conclusions: These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.

Download Full-text

Turning vice into virtue: Using Batch-Effects to Detect Errors in Large Genomic Datasets

10.1101/189670 ◽

2017 ◽

Cited By ~ 1

Author(s):

Fabrizio Mafessoni ◽

Rashmi B Prasad ◽

Leif Groop ◽

Ola Hansson ◽

Kay Prüfer

Keyword(s):

Large Scale ◽

Systematic Errors ◽

Large Datasets ◽

Batch Effects ◽

Sequencing Technology ◽

Combine Data ◽

Coding Regions ◽

1000 Genomes ◽

Sequencing Platforms ◽

Number Of Individuals

AbstractIt is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are less often found than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.

Download Full-text

Identifying Core Parts in Complex Mechanical Product for Change Management and Sustainable Design

Sustainability ◽

10.3390/su10124480 ◽

2018 ◽

Vol 10 (12) ◽

pp. 4480 ◽

Cited By ~ 1

Author(s):

Na Zhang ◽

Yu Yang ◽

Jianxin Wang ◽

Baodong Li ◽

Jiafu Su

Keyword(s):

Design Process ◽

Sustainable Design ◽

Closeness Centrality ◽

Node Degree ◽

Global Information ◽

Degree Centrality ◽

Centrality Measure ◽

The Core ◽

Mechanical Products ◽

Negative Impacts

Changes in customer needs are unavoidable during the design process of complex mechanical products, and may bring severely negative impacts on product design, such as extra costs and delays. One of the effective ways to prevent and reduce these negative impacts is to evaluate and manage the core parts of the product. Therefore, in this paper, a modified Dempster-Shafer (D-S) evidential approach is proposed for identifying the core parts. Firstly, an undirected weighted network model is constructed to systematically describe the product structure. Secondly, a modified D-S evidential approach is proposed to systematically and scientifically evaluate the core parts, which takes into account the degree of the nodes, the weights of the nodes, the positions of the nodes, and the global information of the network. Finally, the evaluation of the core parts of a wind turbine is carried out to illustrate the effectiveness of the proposed method in the paper. The results show that the modified D-S evidential approach achieves better performance regarding the evaluation of core parts than the node degree centrality measure, node betweenness centrality measure, and node closeness centrality measure.

Download Full-text

IBDkin: fast estimation of kinship coefficients from identity by descent segments

Bioinformatics ◽

10.1093/bioinformatics/btaa569 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4519-4520

Author(s):

Ying Zhou ◽

Sharon R Browning ◽

Brian L Browning

Keyword(s):

Software Package ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Uk Biobank ◽

Identity By Descent ◽

Fast Estimation ◽

Kinship Coefficients ◽

Related Individuals ◽

The Uk

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An Early Assessment of Relationship between Spatial Diffusion of Covid-19 and Evolving Worldwide Commercial Aviation Network: Exploring Global and Local Changes in Aviation Networks Using Node Degree Centrality

The Korean Association of Space and Environment Research ◽

10.19097/kaser.2020.30.3.138 ◽

2020 ◽

pp. 138-166

Author(s):

Yongha Park ◽

Jeongwoong Sohn

Keyword(s):

Node Degree ◽

Spatial Diffusion ◽

Degree Centrality ◽

Early Assessment ◽

Commercial Aviation ◽

Aviation Network ◽

Global And Local

Download Full-text