Parallel graph Laplacian for large datasets

In this article, I discuss three statistical tools that have proven pivotal in linguistic research, particularly those studies that seek to evaluate large datasets. These tools are the Gaussian Curve, significance tests, and hierarchical clustering. I present a brief description of these tools and their general uses. Then, I apply them to an analysis of the variations between the “biblical” DSS and our other witnesses, focusing upon variations involving particles. Finally, I engage the recent debate surrounding the diachronic study of Biblical Hebrew. This article serves a dual function. First, it presents statistical tools that are useful for many linguistic studies. Second, it develops an analysis of the he-locale, as it is used in the “biblical” Dead Sea Scrolls, Masoretic Text, and Samaritan Pentateuch. Through that analysis, this article highlights the value of inferential statistical tools as we attempt to better understand the Hebrew of our ancient witnesses.

Download Full-text

mmpdb: An Open Source Matched Molecular Pair Platform for Large Multi-Property Datasets

10.26434/chemrxiv.5999375 ◽

2018 ◽

Author(s):

Andrew Dalke ◽

Jerome Hert ◽

Christian Kramer

Keyword(s):

Open Source ◽

Large Datasets ◽

Molecular Pair ◽

New Algorithms

We present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. mmpdb is freely available.

Download Full-text

Fast Learning of Generalized Minimum Enclosing Ball for Large Datasets

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2012.01831 ◽

2012 ◽

Vol 38 (11) ◽

pp. 1831

Author(s):

Wen-Jun HU ◽

Shi-Tong WANG ◽

Juan WANG ◽

Wen-Hao YING

Keyword(s):

Large Datasets ◽

Fast Learning ◽

Minimum Enclosing Ball

Download Full-text

Variable Reduction and Variable Selection Methods Using Small, Medium and Large Datasets: A Forecast Comparison for the PEEIs

SSRN Electronic Journal ◽

10.2139/ssrn.2444421 ◽

2014 ◽

Author(s):

George Kapetanios ◽

Massimiliano Giuseppe Marcellino ◽

Fotis Papailias

Keyword(s):

Variable Selection ◽

Large Datasets ◽

Selection Methods ◽

Variable Reduction ◽

Forecast Comparison

Download Full-text

Mathematics of networks

10.1093/oso/9780198805090.003.0006 ◽

2018 ◽

Author(s):

Mark Newman

Keyword(s):

Random Walks ◽

Adjacency Matrix ◽

Graph Partitioning ◽

Dynamic Networks ◽

Graph Laplacian ◽

Network Visualization ◽

Final Part ◽

Bipartite Networks ◽

Component Structure ◽

Basic Properties

An introduction to the mathematical tools used in the study of networks. Topics discussed include: the adjacency matrix; weighted, directed, acyclic, and bipartite networks; multilayer and dynamic networks; trees; planar networks. Some basic properties of networks are then discussed, including degrees, density and sparsity, paths on networks, component structure, and connectivity and cut sets. The final part of the chapter focuses on the graph Laplacian and its applications to network visualization, graph partitioning, the theory of random walks, and other problems.

Download Full-text

Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

Knowledge-Based Systems ◽

10.1016/j.knosys.2020.105777 ◽

2020 ◽

Vol 196 ◽

pp. 105777

Author(s):

Jadson Jose Monteiro Oliveira ◽

Robson Leonardo Ferreira Cordeiro

Keyword(s):

Dimensionality Reduction ◽

Large Datasets ◽

Very Large Datasets ◽

The Right

Download Full-text

Country of Origin Effects on the Average Annual Values of NHL Player Contracts

International Journal of Financial Studies ◽

10.3390/ijfs7020024 ◽

2019 ◽

Vol 7 (2) ◽

pp. 24

Author(s):

Aju J. Fenn ◽

Lucas Gerdes ◽

Samuel Rothstein

Keyword(s):

Quantile Regression ◽

Fixed Effects ◽

Country Of Origin ◽

Large Datasets ◽

National Hockey League ◽

Dummy Variables ◽

Country Of Origin Effects ◽

Performance Statistics ◽

Career Performance ◽

Using Data

Using data from 2005 to 2016, this paper examines if players in the National Hockey League (NHL) are being paid a positive differential for their services due to the competition from the Kontinental Hockey League (KHL) and the Swedish Hockey League (SHL). In order to control for performance, we use two different large datasets, (N = 4046) and (N = 1717). In keeping with the existing literature, we use lagged performance statistics and dummy variables to control for the type of NHL contract. The first dataset contains lagged career performance statistics, while the performance statistics are based on the statistics generated during the years under the player’s previous contract. Fixed effects least squares (FELS) and quantile regression results suggest that player production statistics, contract status, and country of origin are significant determinants of NHL player salaries.

Download Full-text

Spectral Gap of the Largest Eigenvalue of the Normalized Graph Laplacian

Communications in Mathematics and Statistics ◽

10.1007/s40304-020-00222-7 ◽

2021 ◽

Author(s):

Jürgen Jost ◽

Raffaella Mulas ◽

Florentin Münch

Keyword(s):

Lower Bound ◽

Bipartite Graph ◽

Spectral Gap ◽

Complete Bipartite Graph ◽

Graph Laplacian ◽

New Method ◽

Single Edge ◽

Largest Eigenvalue ◽

Complement Graph ◽

The Largest Eigenvalue

AbstractWe offer a new method for proving that the maxima eigenvalue of the normalized graph Laplacian of a graph with n vertices is at least $$\frac{n+1}{n-1}$$ n + 1 n - 1 provided the graph is not complete and that equality is attained if and only if the complement graph is a single edge or a complete bipartite graph with both parts of size $$\frac{n-1}{2}$$ n - 1 2 . With the same method, we also prove a new lower bound to the largest eigenvalue in terms of the minimum vertex degree, provided this is at most $$\frac{n-1}{2}$$ n - 1 2 .

Download Full-text

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

npj Digital Medicine ◽

10.1038/s41746-021-00488-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Danqing Xu ◽

Chen Wang ◽

Atlas Khan ◽

Ning Shang ◽

Zihuai He ◽

...

Keyword(s):

Risk Stratification ◽

Disease Risk ◽

Association Studies ◽

Large Datasets ◽

Risk Scores ◽

Sequencing Data ◽

Case Definitions ◽

Phenotypic Data ◽

Clinical Risk ◽

Phenotypic Features

AbstractLabeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

Download Full-text