scholarly journals Spanish electoral archive. SEA database

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Virgilio Pérez ◽  
Cristina Aybar ◽  
Jose M. Pavía

AbstractThis paper introduces the SEA database (acronym for Spanish Electoral Archive). SEA brings together the most complete public repository available to date on Spanish election outcomes. SEA holds all the results recorded from the electoral processes of General (1979–2019), Regional (1989–2021), Local (1979–2019) and European Parliamentary (1987–2019) elections held in Spain since the restoration of democracy in the late 70 s, in addition to other data sets with electoral content. The data are offered for free and is presented in a homogeneous and friendly format. Most of the databases are available for download with data from various electoral levels, including from the ballot box level. This paper details how the information is organized, what the main variables are on offer for each election, which processes were applied to the data for their homogenization, and discusses future areas of work. This data has many applications, for example, as inputs in election prediction models and in ecological inference algorithms, to study determinants of turnout or voting, or for defining marketing strategies.

2021 ◽  
pp. 089443932110408
Author(s):  
Jose M. Pavía

Ecological inference models aim to infer individual-level relationships using aggregate data. They are routinely used to estimate voter transitions between elections, disclose split-ticket voting behaviors, or infer racial voting patterns in U.S. elections. A large number of procedures have been proposed in the literature to solve these problems; therefore, an assessment and comparison of them are overdue. The secret ballot however makes this a difficult endeavor since real individual data are usually not accessible. The most recent work on ecological inference has assessed methods using a very small number of data sets with ground truth, combined with artificial, simulated data. This article dramatically increases the number of real instances by presenting a unique database (available in the R package ei.Datasets) composed of data from more than 550 elections where the true inner-cell values of the global cross-classification tables are known. The article describes how the data sets are organized, details the data curation and data wrangling processes performed, and analyses the main features characterizing the different data sets.


Author(s):  
William A Freyman ◽  
Kimberly F McManus ◽  
Suyash S Shringarpure ◽  
Ethan M Jewett ◽  
Katarzyna Bryc ◽  
...  

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.


Author(s):  
Chen Lin ◽  
Xiaolin Shen ◽  
Si Chen ◽  
Muhua Zhu ◽  
Yanghua Xiao

The study of consumer psychology reveals two categories of consumption decision procedures: compensatory rules and non-compensatory rules. Existing recommendation models which are based on latent factor models assume the consumers follow the compensatory rules, i.e. they evaluate an item over multiple aspects and compute a weighted or/and summated score which is used to derive the rating or ranking of the item. However, it has been shown in the literature of consumer behavior that, consumers adopt non-compensatory rules more often than compensatory rules. Our main contribution in this paper is to study the unexplored area of utilizing non-compensatory rules in recommendation models.Our general assumptions are (1) there are K universal hidden aspects. In each evaluation session, only one aspect is chosen as the prominent aspect according to user preference. (2) Evaluations over prominent and non-prominent aspects are non-compensatory. Evaluation is mainly based on item performance on the prominent aspect. For non-prominent aspects the user sets a minimal acceptable threshold. We give a conceptual model for these general assumptions. We show how this conceptual model can be realized in both pointwise rating prediction models and pair-wise ranking prediction models. Experiments on real-world data sets validate that adopting non-compensatory rules improves recommendation performance for both rating and ranking models.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2019 ◽  
Vol 15 (2) ◽  
pp. 201-214 ◽  
Author(s):  
Mahmoud Elish

Purpose Effective and efficient software security inspection is crucial as the existence of vulnerabilities represents severe risks to software users. The purpose of this paper is to empirically evaluate the potential application of Stochastic Gradient Boosting Trees (SGBT) as a novel model for enhanced prediction of vulnerable Web components compared to common, popular and recent machine learning models. Design/methodology/approach An empirical study was conducted where the SGBT and 16 other prediction models have been trained, optimized and cross validated using vulnerability data sets from multiple versions of two open-source Web applications written in PHP. The prediction performance of these models have been evaluated and compared based on accuracy, precision, recall and F-measure. Findings The results indicate that the SGBT models offer improved prediction over the other 16 models and thus are more effective and reliable in predicting vulnerable Web components. Originality/value This paper proposed a novel application of SGBT for enhanced prediction of vulnerable Web components and showed its effectiveness.


2015 ◽  
Vol 15 (6) ◽  
pp. 1407-1423 ◽  
Author(s):  
R. D. Field ◽  
A. C. Spessa ◽  
N. A. Aziz ◽  
A. Camia ◽  
A. Cantin ◽  
...  

Abstract. The Canadian Forest Fire Weather Index (FWI) System is the mostly widely used fire danger rating system in the world. We have developed a global database of daily FWI System calculations, beginning in 1980, called the Global Fire WEather Database (GFWED) gridded to a spatial resolution of 0.5° latitude by 2/3° longitude. Input weather data were obtained from the NASA Modern Era Retrospective-Analysis for Research and Applications (MERRA), and two different estimates of daily precipitation from rain gauges over land. FWI System Drought Code calculations from the gridded data sets were compared to calculations from individual weather station data for a representative set of 48 stations in North, Central and South America, Europe, Russia, Southeast Asia and Australia. Agreement between gridded calculations and the station-based calculations tended to be most different at low latitudes for strictly MERRA-based calculations. Strong biases could be seen in either direction: MERRA DC over the Mato Grosso in Brazil reached unrealistically high values exceeding DC = 1500 during the dry season but was too low over Southeast Asia during the dry season. These biases are consistent with those previously identified in MERRA's precipitation, and they reinforce the need to consider alternative sources of precipitation data. GFWED can be used for analyzing historical relationships between fire weather and fire activity at continental and global scales, in identifying large-scale atmosphere–ocean controls on fire weather, and calibration of FWI-based fire prediction models.


Author(s):  
Afrand Agah ◽  
Mehran Asadi

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.


2020 ◽  
Author(s):  
William A. Freyman ◽  
Kimberly F. McManus ◽  
Suyash S. Shringarpure ◽  
Ethan M. Jewett ◽  
Katarzyna Bryc ◽  
...  

AbstractEstimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.


2021 ◽  
Author(s):  
Yifan Feng ◽  
Ye Wang ◽  
Yangqin Xie ◽  
Shuwei Wu ◽  
Yuyang Li ◽  
...  

Abstract BackgroundThe purpose of this study is to explore the factors that affect the prognosis of overall survival (OS) and cancer special survival (CSS) in cervical cancer with stage IIIC1 and establish nomogram models to predict this prognosis.MethodsData from The Surveil-lance, Epidemiology, and End Results (SEER) Program meeting the inclusion criterions were classified into training group, and data of validation were obtained from the First Affiliated Hospital of Anhui Medical University from 2010 to 2019. The incidence, Kaplan‐Meier curves, OS and CSS of stage IIIC1 were evaluated according to the training group. Nomograms were established according to the results of univariate and multivariate Cox regression models. Harrell’s C-index and receiver operating characteristic curve (ROC) were calculated to measure the accuracy of the prediction models. Calibration plots show the relationship between the predicted probability and the actual outcome. Decision-curve analysis (DCA) was applied to evaluate the clinical applicability of the constructed nomogram.ResultsThe incidence of pelvic lymph node metastasis, a high-risk factor for prognosis in cervical cancer, decreased slightly over time. There are eight independent prognostic variables for OS, including age, race, histology, differentiation, extension range, tumor size, radiation recode and surgery, but seven for CSS with age excluded. Nomograms of OS and CSS were established based on the results. The C-index for the nomograms of OS and CSS were 0.692, 0.689 respectively when random sampling of SEER data sets, and 0.706, 0.737 respectively when random sampling of external data sets. AUCs for the nomogram of OS were 0.648, 0.644 respectively, and 0.683, 0.675 for the nomogram of CSS. Calibration plots for the nomograms were almost identical to the actual observations. The DCA also proved the value of the two models.ConclusionAge, race, histology, differentiation, extension range, tumor size, radiation recode and surgery were all independent prognosis factors for OS. Only age excepts in CSS. OS and CSS nomograms were established in our study based on the result of multivariate Cox proportional hazard regression, and both own good predictive and clinical application value after validation.


Author(s):  
Afrand Agah ◽  
Mehran Asadi

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.


Sign in / Sign up

Export Citation Format

Share Document