Accelerating the HyperLogLog Cardinality Estimation Algorithm

In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

Sequential Monte Carlo based parameter estimation for structural health monitoring with an Intel Xeon Phi optimized ultrasound kernel

10.1063/1.5099739 ◽

2019 ◽

Author(s):

William C. Schneck ◽

Heather Reed ◽

Elizabeth D. Gregory ◽

Cara A. C. Leckey

Keyword(s):

Monte Carlo ◽

Parameter Estimation ◽

Structural Health Monitoring ◽

Health Monitoring ◽

Sequential Monte Carlo ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Structural Health ◽

Intel Xeon

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

The Complete DNA Sequence of the Mitochondrial Genome of a “Living Fossil,” the Coelacanth (Latimeria chalumnae)

Genetics ◽

10.1093/genetics/146.3.995 ◽

1997 ◽

Vol 146 (3) ◽

pp. 995-1010 ◽

Cited By ~ 1

Author(s):

Rafael Zardoya ◽

Axel Meyer

Keyword(s):

Mitochondrial Genome ◽

Tandem Repeats ◽

Phylogenetic Analyses ◽

Large Data ◽

Molecular Data ◽

Phylogenetic Position ◽

Data Set ◽

Living Fossil ◽

Latimeria Chalumnae ◽

Relationship Of

The complete nucleotide sequence of the 16,407-bp mitochondrial genome of the coelacanth (Latimeria chalumnae) was determined. The coelacanth mitochondrial genome order is identical to the consensus vertebrate gene order which is also found in all ray-finned fishes, the lungfish, and most tetrapods. Base composition and codon usage also conform to typical vertebrate patterns. The entire mitochondrial genome was PCR-amplified with 24 sets of primers that are expected to amplify homologous regions in other related vertebrate species. Analyses of the control region of the coelacanth mitochondrial genome revealed the existence of four 22-bp tandem repeats close to its 3′ end. The phylogenetic analyses of a large data set combining genes coding for rRNAs, tRNA, and proteins (16,140 characters) confirmed the phylogenetic position of the coelacanth as a lobe-finned fish; it is more closely related to tetrapods than to ray-finned fishes. However, different phylogenetic methods applied to this largest available molecular data set were unable to resolve unambiguously the relationship of the coelacanth to the two other groups of extant lobe-finned fishes, the lungfishes and the tetrapods. Maximum parsimony favored a lungfish/coelacanth or a lungfish/tetrapod sistergroup relationship depending on which transversion:transition weighting is assumed. Neighbor-joining and maximum likelihood supported a lungfish/tetrapod sistergroup relationship.

Download Full-text

Climatology of nutrient distributions in the South China Sea based on a large data set derived from a new algorithm

Progress In Oceanography ◽

10.1016/j.pocean.2021.102586 ◽

2021 ◽

pp. 102586

Author(s):

Chuanjun Du ◽

Ruoying He ◽

Zhiyu Liu ◽

Tao Huang ◽

Lifang Wang ◽

...

Keyword(s):

South China Sea ◽

South China ◽

Large Data ◽

The South China Sea ◽

The South ◽

Data Set ◽

China Sea ◽

Large Data Set

Download Full-text

Spike detection: Inter-reader agreement and a statistical Turing test on a large data set

Clinical Neurophysiology ◽

10.1016/j.clinph.2016.11.005 ◽

2017 ◽

Vol 128 (1) ◽

pp. 243-250 ◽

Cited By ~ 55

Author(s):

Mark L. Scheuer ◽

Anto Bagic ◽

Scott B. Wilson

Keyword(s):

Large Data ◽

Turing Test ◽

Spike Detection ◽

Data Set ◽

Large Data Set

Download Full-text

Horizontal interactions in local personal income taxes

The Annals of Regional Science ◽

10.1007/s00168-020-01039-6 ◽

2021 ◽

Author(s):

Johan Lundberg

Keyword(s):

Population Size ◽

Political Representation ◽

Large Data ◽

Personal Income ◽

Set Covering ◽

Tax Rates ◽

Data Set ◽

Relative Population ◽

Tax Rate ◽

Local Council

AbstractTheories of inter-jurisdictional tax and yardstick competition assume that the tax decisions of one jurisdiction will influence the tax decisions of other jurisdictions. This paper empirically addresses the issue of horizontal dependence in local personal income tax rates across jurisdictions. Based on a large data set covering Swedish municipalities over a period of 14 years, we test for interactions across municipalities that share a common border, across municipalities within a distance of 100 km of each other, and across municipalities with similar political representation in the local council. We also test the hypothesis that the tax rate of relatively larger municipalities has a greater influence on their neighbors' tax rate compared to the influence of their smaller neighbors. Our results suggest that when lagged tax rates are controlled for, the horizontal correlation across municipalities that share a common border or are within a distance of 100 km from each other becomes insignificant. This result is of importance as it suggests that lagged tax rates should be included or at least tested for when testing for horizontal interactions or mimicking in local tax rates. However, our results support the hypothesis of horizontal interactions across municipalities that share a common border when the influence of neighboring municipalities is also weighted by their relative population size, i.e. relatively larger neighbors tend to have a greater impact on their neighbor's tax rates than their relatively smaller neighbors. This is of importance as it suggests that distance or proximity matters, although only in combination with the relative population size. We also find some evidence of horizontal dependence across municipalities with similar political preferences.

Download Full-text