Multivariate return periods in hydrology: a critical and practical review focusing on synthetic design hydrograph estimation

Abstract. Most of the hydrological and hydraulic studies refer to the notion of a return period to quantify design variables. When dealing with multiple design variables, the well-known univariate statistical analysis is no longer satisfactory, and several issues challenge the practitioner. How should one incorporate the dependence between variables? How should a multivariate return period be defined and applied in order to yield a proper design event? In this study an overview of the state of the art for estimating multivariate design events is given and the different approaches are compared. The construction of multivariate distribution functions is done through the use of copulas, given their practicality in multivariate frequency analyses and their ability to model numerous types of dependence structures in a flexible way. A synthetic case study is used to generate a large data set of simulated discharges that is used for illustrating the effect of different modelling choices on the design events. Based on different uni- and multivariate approaches, the design hydrograph characteristics of a 3-D phenomenon composed of annual maximum peak discharge, its volume, and duration are derived. These approaches are based on regression analysis, bivariate conditional distributions, bivariate joint distributions and Kendall distribution functions, highlighting theoretical and practical issues of multivariate frequency analysis. Also an ensemble-based approach is presented. For a given design return period, the approach chosen clearly affects the calculated design event, and much attention should be given to the choice of the approach used as this depends on the real-world problem at hand.

Download Full-text

Joint return periods in hydrology: a critical and practical review focusing on synthetic design hydrograph estimation

Hydrology and Earth System Sciences Discussions ◽

10.5194/hessd-9-6781-2012 ◽

2012 ◽

Vol 9 (5) ◽

pp. 6781-6828 ◽

Cited By ~ 15

Author(s):

S. Vandenberghe ◽

M. J. van den Berg ◽

B. Gräler ◽

A. Petroselli ◽

S. Grimaldi ◽

...

Keyword(s):

Frequency Analysis ◽

Return Period ◽

Three Dimensional ◽

Distribution Functions ◽

Return Periods ◽

Design Variables ◽

Joint Return ◽

Design Values ◽

Multivariate Frequency Analysis ◽

Joint Return Period

Abstract. Most of the hydrological and hydraulic studies refer to the notion of a return period to quantify design variables. When dealing with multiple design variables, the well-known univariate statistical analysis is no longer satisfactory and several issues challenge the practitioner. How should one incorporate the dependence between variables? How should the joint return period be defined and applied? In this study, an overview of the state-of-the-art for defining joint return periods is given. The construction of multivariate distribution functions is done through the use of copulas, given their practicality in multivariate frequency analysis and their ability to model numerous types of dependence structures in a flexible way. A case study focusing on the selection of design hydrograph characteristics is presented and the design values of a three-dimensional phenomenon composed of peak discharge, volume and duration are derived. Joint return period methods based on regression analysis, bivariate conditional distributions, bivariate joint distributions, and Kendal distribution functions are investigated and compared highlighting theoretical and practical issues of multivariate frequency analysis. Also an ensemble-based method is introduced. For a given design return period, the method chosen clearly affects the calculated design event. Eventually, light is shed on the practical implications of a chosen method.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text

Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3665 ◽

2020 ◽

Vol 501 (1) ◽

pp. 994-1001

Author(s):

Suman Sarkar ◽

Biswajit Pandey ◽

Snehasish Bhattacharjee

Keyword(s):

Spatial Distribution ◽

Mutual Information ◽

Local Density ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Host Galaxy ◽

Data Sets ◽

Data Set ◽

Information Theoretic

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

The Complete DNA Sequence of the Mitochondrial Genome of a “Living Fossil,” the Coelacanth (Latimeria chalumnae)

Genetics ◽

10.1093/genetics/146.3.995 ◽

1997 ◽

Vol 146 (3) ◽

pp. 995-1010 ◽

Cited By ~ 1

Author(s):

Rafael Zardoya ◽

Axel Meyer

Keyword(s):

Mitochondrial Genome ◽

Tandem Repeats ◽

Phylogenetic Analyses ◽

Large Data ◽

Molecular Data ◽

Phylogenetic Position ◽

Data Set ◽

Living Fossil ◽

Latimeria Chalumnae ◽

Relationship Of

The complete nucleotide sequence of the 16,407-bp mitochondrial genome of the coelacanth (Latimeria chalumnae) was determined. The coelacanth mitochondrial genome order is identical to the consensus vertebrate gene order which is also found in all ray-finned fishes, the lungfish, and most tetrapods. Base composition and codon usage also conform to typical vertebrate patterns. The entire mitochondrial genome was PCR-amplified with 24 sets of primers that are expected to amplify homologous regions in other related vertebrate species. Analyses of the control region of the coelacanth mitochondrial genome revealed the existence of four 22-bp tandem repeats close to its 3′ end. The phylogenetic analyses of a large data set combining genes coding for rRNAs, tRNA, and proteins (16,140 characters) confirmed the phylogenetic position of the coelacanth as a lobe-finned fish; it is more closely related to tetrapods than to ray-finned fishes. However, different phylogenetic methods applied to this largest available molecular data set were unable to resolve unambiguously the relationship of the coelacanth to the two other groups of extant lobe-finned fishes, the lungfishes and the tetrapods. Maximum parsimony favored a lungfish/coelacanth or a lungfish/tetrapod sistergroup relationship depending on which transversion:transition weighting is assumed. Neighbor-joining and maximum likelihood supported a lungfish/tetrapod sistergroup relationship.

Download Full-text

Climatology of nutrient distributions in the South China Sea based on a large data set derived from a new algorithm

Progress In Oceanography ◽

10.1016/j.pocean.2021.102586 ◽

2021 ◽

pp. 102586

Author(s):

Chuanjun Du ◽

Ruoying He ◽

Zhiyu Liu ◽

Tao Huang ◽

Lifang Wang ◽

...

Keyword(s):

South China Sea ◽

South China ◽

Large Data ◽

The South China Sea ◽

The South ◽

Data Set ◽

China Sea ◽

Large Data Set

Download Full-text

Spike detection: Inter-reader agreement and a statistical Turing test on a large data set

Clinical Neurophysiology ◽

10.1016/j.clinph.2016.11.005 ◽

2017 ◽

Vol 128 (1) ◽

pp. 243-250 ◽

Cited By ~ 55

Author(s):

Mark L. Scheuer ◽

Anto Bagic ◽

Scott B. Wilson

Keyword(s):

Large Data ◽

Turing Test ◽

Spike Detection ◽

Data Set ◽

Large Data Set

Download Full-text

Horizontal interactions in local personal income taxes

The Annals of Regional Science ◽

10.1007/s00168-020-01039-6 ◽

2021 ◽

Author(s):

Johan Lundberg

Keyword(s):

Population Size ◽

Political Representation ◽

Large Data ◽

Personal Income ◽

Set Covering ◽

Tax Rates ◽

Data Set ◽

Relative Population ◽

Tax Rate ◽

Local Council

AbstractTheories of inter-jurisdictional tax and yardstick competition assume that the tax decisions of one jurisdiction will influence the tax decisions of other jurisdictions. This paper empirically addresses the issue of horizontal dependence in local personal income tax rates across jurisdictions. Based on a large data set covering Swedish municipalities over a period of 14 years, we test for interactions across municipalities that share a common border, across municipalities within a distance of 100 km of each other, and across municipalities with similar political representation in the local council. We also test the hypothesis that the tax rate of relatively larger municipalities has a greater influence on their neighbors' tax rate compared to the influence of their smaller neighbors. Our results suggest that when lagged tax rates are controlled for, the horizontal correlation across municipalities that share a common border or are within a distance of 100 km from each other becomes insignificant. This result is of importance as it suggests that lagged tax rates should be included or at least tested for when testing for horizontal interactions or mimicking in local tax rates. However, our results support the hypothesis of horizontal interactions across municipalities that share a common border when the influence of neighboring municipalities is also weighted by their relative population size, i.e. relatively larger neighbors tend to have a greater impact on their neighbor's tax rates than their relatively smaller neighbors. This is of importance as it suggests that distance or proximity matters, although only in combination with the relative population size. We also find some evidence of horizontal dependence across municipalities with similar political preferences.

Download Full-text