Fast SVM training using data reconstruction for classification of very large datasets

2019 ◽  
Vol 15 (3) ◽  
pp. 372-381
Author(s):  
Peifeng Liang ◽  
Weite Li ◽  
Jinglu Hu
2020 ◽  
Vol 196 ◽  
pp. 105777
Author(s):  
Jadson Jose Monteiro Oliveira ◽  
Robson Leonardo Ferreira Cordeiro

2019 ◽  
Vol 7 (2) ◽  
pp. 24
Author(s):  
Aju J. Fenn ◽  
Lucas Gerdes ◽  
Samuel Rothstein

Using data from 2005 to 2016, this paper examines if players in the National Hockey League (NHL) are being paid a positive differential for their services due to the competition from the Kontinental Hockey League (KHL) and the Swedish Hockey League (SHL). In order to control for performance, we use two different large datasets, (N = 4046) and (N = 1717). In keeping with the existing literature, we use lagged performance statistics and dummy variables to control for the type of NHL contract. The first dataset contains lagged career performance statistics, while the performance statistics are based on the statistics generated during the years under the player’s previous contract. Fixed effects least squares (FELS) and quantile regression results suggest that player production statistics, contract status, and country of origin are significant determinants of NHL player salaries.


Author(s):  
I Misztal ◽  
I Aguilar ◽  
D Lourenco ◽  
L Ma ◽  
J Steibel ◽  
...  

Abstract Genomic selection is now practiced successfully across many species. However, many questions remain such as long-term effects, estimations of genomic parameters, robustness of GWAS with small and large datasets, and stability of genomic predictions. This study summarizes presentations from at the 2020 ASAS symposium. The focus of many studies until now is on linkage disequilibrium (LD) between two loci. Ignoring higher level equilibrium may lead to phantom dominance and epistasis. The Bulmer effect leads to a reduction of the additive variance; however, selection for increased recombination rate can release anew genetic variance. With genomic information, estimates of genetic parameters may be biased by genomic preselection, but costs of estimation can increase drastically due to the dense form of the genomic information. To make computation of estimates feasible, genotypes could be retained only for the most important animals, and methods of estimation should use algorithms that can recognize dense blocks in sparse matrices. GWAS studies using small genomic datasets frequently find many marker-trait associations whereas studies using much bigger datasets find only a few. Most current tools use very simple models for GWAS, possibly causing artifacts. These models are adequate for large datasets where pseudo-phenotypes such as deregressed proofs indirectly account for important effects for traits of interest. Artifacts arising in GWAS with small datasets can be minimized by using data from all animals (whether genotyped or not), realistic models, and methods that account for population structure. Recent developments permit computation of p-values from GBLUP, where models can be arbitrarily complex but restricted to genotyped animals only, and to single-step GBLUP that also uses phenotypes from ungenotyped animals. Stability was an important part of nongenomic evaluations, where genetic predictions were stable in the absence of new data even with low prediction accuracies. Unfortunately, genomic evaluations for such animals change because all animals with genotypes are connected. A top ranked animal can easily drop in the next evaluation, causing a crisis of confidence in genomic evaluations. While correlations between consecutive genomic evaluations are high, outliers can have differences as high as one SD. A solution to fluctuating genomic evaluations is to base selection decisions on groups of animals. While many issues in genomic selection have been solved, many new issues that require additional research continue to surface.


2021 ◽  
Vol 11 (9) ◽  
pp. 3974
Author(s):  
Laila Bashmal ◽  
Yakoub Bazi ◽  
Mohamad Mahmoud Al Rahhal ◽  
Haikel Alhichri ◽  
Naif Al Ajlan

In this paper, we present an approach for the multi-label classification of remote sensing images based on data-efficient transformers. During the training phase, we generated a second view for each image from the training set using data augmentation. Then, both the image and its augmented version were reshaped into a sequence of flattened patches and then fed to the transformer encoder. The latter extracts a compact feature representation from each image with the help of a self-attention mechanism, which can handle the global dependencies between different regions of the high-resolution aerial image. On the top of the encoder, we mounted two classifiers, a token and a distiller classifier. During training, we minimized a global loss consisting of two terms, each corresponding to one of the two classifiers. In the test phase, we considered the average of the two classifiers as the final class labels. Experiments on two datasets acquired over the cities of Trento and Civezzano with a ground resolution of two-centimeter demonstrated the effectiveness of the proposed model.


2021 ◽  
Vol 12 (1) ◽  
pp. 21-28
Author(s):  
Dmitry Nartymov ◽  
Evgeny Kharitonov ◽  
Elena Dubina ◽  
Sergey Garkusha ◽  
Margarita Ruban ◽  
...  

This article presents the results of the development of a methodology for describing the main morphological and cultural traits of the Pyricularia oryzae Cav. strains widespread in the south of Russia. At the same time, the types of traits are identified and listed, which make it possible to unambiguously determine the uniqueness and variety of the pathogen. The relationships and patterns established using cluster and statistical analysis make it possible to identify the conditions for the development of a pathogen that determine its predominant forms. Thus, research shows that leaf forms of P. oryzae strains isolated from rice plants with leaf form of blast disease have an equally directional growth pattern of a colony with a felt structure, and strains isolated from neck-affected plant form often produce a zone of a colony with a clumpy structure. The classification of cultural traits will make it possible to obtain scientifically grounded and comparable data that can be used in the analysis of the interaction of P. oryzae strains with rice plants on various varieties and in various agro-technological conditions in order to improve and rationalize agricultural activities. The study opens up the possibility of using data in breeding, making it possible to identify forms of a pathogen that infect certain varieties.


Author(s):  
K. G. Yashchenkov ◽  
K. S. Dymko ◽  
N. O. Ukhanov ◽  
A. V. Khnykin

The issues of using data analysis methods to find and correct errors in the reports issued by meteorologists are considered. The features of processing various types of meteorological messages are studied. The advantages and disadvantages of existing methods of classification of text information are considered. The classification methods are compared in order to identify the optimal method that will be used in the developed algorithm for analyzing meteorological messages. The prospects of using each of the methods in the developed algorithm are described. An algorithm for processing the source data is proposed, which consists in using syntactic and logical analysis to preclean the data from various kinds of noise and determine format errors for each type of message. After preliminary preparation the classification method correlates the received set of message characteristics with the previously trained model to determine the error of the current weather report and output the corresponding message to the operator in real time. The software tools used in the algorithm development and implementation processes are described. A complete description of the process of processing a meteorological message is presented from the moment when the message is entered in a text editor until the message is sent to the international weather message exchange service. The developed software is demonstrated, in which the proposed algorithm is implemented, which allows to improve the quality of messages and, as a result, the quality of meteorological forecasts. The results of the implementation of the new algorithm are described by comparing the number of messages containing various types of errors before the implementation of the algorithm and after the implementation.


2018 ◽  
Vol 14 (1) ◽  
pp. 43-50 ◽  
Author(s):  
Anna Fitzpatrick ◽  
Joseph A Stone ◽  
Simon Choppin ◽  
John Kelley

Performance analysis and identifying performance characteristics associated with success are of great importance to players and coaches in any sport. However, while large amounts of data are available within elite tennis, very few players employ an analyst or attempt to exploit the data to enhance their performance; this is partly attributable to the considerable time and complex techniques required to interpret these large datasets. Using data from the 2016 and 2017 French Open tournaments, we tested the agreement between the results of a simple new method for identifying important performance characteristics (the Percentage of matches in which the Winner Outscored the Loser, PWOL) and the results of two standard statistical methods to establish the validity of the simple method. Spearman’s rank-order correlations between the results of the three methods demonstrated excellent agreement, with all methods identifying the same three performance characteristics ( points won of 0–4 rally length, baseline points won and first serve points won) as strongly associated with success. Consequently, we propose that the PWOL method is valid for identifying performance characteristics associated with success in tennis, and is therefore a suitable alternative to more complex statistical methods, as it is simpler to calculate, interpret and contextualise.


Sign in / Sign up

Export Citation Format

Share Document