scholarly journals MAPEANDO USOS/COBERTURAS DA TERRA COM Semi-automatic Classification Plugin: QUAIS DADOS, CLASSIFICADOR E ESTRATÉGIA AMOSTRAL?

Nativa ◽  
2019 ◽  
Vol 7 (1) ◽  
pp. 70 ◽  
Author(s):  
Luís Flávio Pereira ◽  
Ricardo Morato Fiúza Guimarães

Este trabalho teve como objetivo sugerir diretrizes para melhor mapear usos da terra usando o complemento Semi-automatic Classification Plugin (SCP) para QGIS, destacando-se quais os melhores conjuntos de dados, classificadores e estratégias amostrais para treinamento. Foram combinados quatro conjuntos de dados derivados de imagem Sentinel 2A, três classificadores disponíveis no SCP, e duas estratégias amostrais: amostras de treinamento (ROI’s) separadas ou dissolvidas em uma única amostra, obtendo-se 24 tratamentos. Os tratamentos foram avaliados quanto à acurácia (coeficiente Kappa), qualidade visual do mapa final e tempo de processamento. Os resultados mostraram que: (1) o SCP é adequado para mapear usos da terra; (2) quanto maior o conjunto de dados, melhor o desempenho do classificador; e (3) a utilização de ROI’s dissolvidas sempre diminui o tempo de processamento, mas apresenta efeito ambíguo sobre os diferentes classificadores. Para melhores resultados, recomenda-se a aplicação do classificador Maximum Likelihood sobre o maior conjunto de dados disponível, utilizando-se amostras de treinamento coletadas contemplando todas as variações intraclasse, e posteriormente dissolvidas em uma única ROI.Palavras-chave: sensoriamento remoto, amostras de treinamento, QGIS, Sentinel 2A,MAPPING LAND USES/COVERS WITH SEMI-AUTOMATIC CLASSIFICATION PLUGIN: WHICH DATA SET, CLASSIFIER AND SAMPLING DESIGN? ABSTRACT: This paper aimed to suggest guidelines to better map land uses using the Semi-automatic Classification Plugin (SCP) for QGIS, highlighting which the best data sets, classifiers and training sampling designs. Four data sets from a Sentinel 2A image were combined with three classifiers available in the SCP, and two sampling designs: separate or dissolved training samples (ROI's) in a single sample, obtaining 24 treatments. The treatments were evaluated regarding the accuracy (Kappa coefficient), visual quality of the final map and processing time. The results suggest that: (1) the SCP is suitable to map land uses; (2) the larger the data set, the better the classifier performance; and (3) the use of dissolved ROI always decreases processing time, but has an ambiguous effect on the different classifiers. In order to get better results, we recommend to apply the Maximum Likelihood classifier on the largest data set available, using training samples that cover all possible intraclass variations, subsequently dissolved in a single ROI.Keywords: remote sensing, training samples, QGIS, Sentinel 2A. 

Author(s):  
Brendan Juba ◽  
Hai S. Le

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.


2020 ◽  
Vol 17 (6) ◽  
pp. 916-925
Author(s):  
Niyati Behera ◽  
Guruvayur Mahalakshmi

Attributes, whether qualitative or non-qualitative are the formal description of any real-world entity and are crucial in modern knowledge representation models like ontology. Though ample evidence for the amount of research done for mining non-qualitative attributes (like part-of relation) extraction from text as well as the Web is available in the wealth of literature, on the other side limited research can be found relating to qualitative attribute (i.e., size, color, taste etc.,) mining. Herein this research article an analytical framework has been proposed to retrieve qualitative attribute values from unstructured domain text. The research objective covers two aspects of information retrieval (1) acquiring quality values from unstructured text and (2) then assigning attribute to them by comparing the Google derived meaning or context of attributes as well as quality value (adjectives). The goal has been accomplished by using a framework which integrates Vector Space Modelling (VSM) with a probabilistic Multinomial Naive Bayes (MNB) classifier. Performance Evaluation has been carried out on two data sets (1) HeiPLAS Development Data set (106 adjective-noun exemplary phrases) and (2) a text data set in Medicinal Plant Domain (MPD). System is found to perform better with probabilistic approach compared to the existing pattern-based framework in the state of art


2016 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).


2021 ◽  
Author(s):  
Petya Kindalova ◽  
Ioannis Kosmidis ◽  
Thomas E. Nichols

AbstractObjectivesWhite matter lesions are a very common finding on MRI in older adults and their presence increases the risk of stroke and dementia. Accurate and computationally efficient modelling methods are necessary to map the association of lesion incidence with risk factors, such as hypertension. However, there is no consensus in the brain mapping literature whether a voxel-wise modelling approach is better for binary lesion data than a more computationally intensive spatial modelling approach that accounts for voxel dependence.MethodsWe review three regression approaches for modelling binary lesion masks including massunivariate probit regression modelling with either maximum likelihood estimates, or mean bias-reduced estimates, and spatial Bayesian modelling, where the regression coefficients have a conditional autoregressive model prior to account for local spatial dependence. We design a novel simulation framework of artificial lesion maps to compare the three alternative lesion mapping methods. The age effect on lesion probability estimated from a reference data set (13,680 individuals from the UK Biobank) is used to simulate a realistic voxel-wise distribution of lesions across age. To mimic the real features of lesion masks, we suggest matching brain lesion summaries (total lesion volume, average lesion size and lesion count) across the reference data set and the simulated data sets. Thus, we allow for a fair comparison between the modelling approaches, under a realistic simulation setting.ResultsOur findings suggest that bias-reduced estimates for voxel-wise binary-response generalized linear models (GLMs) overcome the drawbacks of infinite and biased maximum likelihood estimates and scale well for large data sets because voxel-wise estimation can be performed in parallel across voxels. Contrary to the assumption of spatial dependence being key in lesion mapping, our results show that voxel-wise bias-reduction and spatial modelling result in largely similar estimates.ConclusionBias-reduced estimates for voxel-wise GLMs are not only accurate but also computationally efficient, which will become increasingly important as more biobank-scale neuroimaging data sets become available.


Symmetry ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1297
Author(s):  
Guillermo Martínez-Flórez ◽  
Heleno Bolfarine ◽  
Yolanda M. Gómez

In this paper, the skew-elliptical sinh-alpha-power distribution is developed as a natural follow-up to the skew-elliptical log-linear Birnbaum–Saunders alpha-power distribution, previously studied in the literature. Special cases include the ordinary log-linear Birnbaum–Saunders and skewed log-linear Birnbaum–Saunders distributions. As shown, it is able to surpass the ordinary sinh-normal models when fitting data sets with high (above the expected with the sinh-normal) degrees of asymmetry. Maximum likelihood estimation is developed with the inverse of the observed information matrix used for standard error estimation. Large sample properties of the maximum likelihood estimators such as consistency and asymptotic normality are established. An application is reported for the data set previously analyzed in the literature, where performance of the new distribution is shown when compared with other proposed alternative models.


Author(s):  
Samuel U. Enogwe ◽  
Chisimkwuo John ◽  
Happiness O. Obiora-Ilouno ◽  
Chrisogonus K. Onyekwere

In this paper, we propose a new lifetime distribution called the generalized weighted Rama (GWR) distribution, which extends the two-parameter Rama distribution and has the Rama distribution as a special case. The GWR distribution has the ability to model data sets that have positive skewness and upside-down bathtub shape hazard rate. Expressions for mathematical and reliability properties of the GWR distribution have been derived. Estimation of parameters was achieved using the method of maximum likelihood estimation and a simulation was performed to verify the stability of the maximum likelihood estimates of the model parameters. The asymptotic confidence intervals of the parameters of the proposed distribution are obtained. The applicability of the GWR distribution was illustrated with a real data set and the results obtained show that the GWR distribution is a better candidate for the data than the other competing distributions being investigated.


2016 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).


Author(s):  
Muhammad H. Tahir ◽  
Muhammad Adnan Hussain ◽  
Gauss Cordeiro ◽  
Mahmoud El-Morshedy ◽  
Mohammed S. Eliwa

For bounded unit interval, we propose a new Kumaraswamy generalized (G) family of distributions from a new generator which could be an alternate to the Kumaraswamy-G family proposed earlier by Cordeiro and de-Castro in 2011. This new generator can also be used to develop alternate G-classes such as beta-G, McDonald-G, Topp-Leone-G, Marshall-Olkin-G and Transmuted-G for bounded unit interval. Some mathematical properties of this new family are obtained and maximum likelihood method is used for estimating the family parameters. We investigate the properties of one special model called a new Kumaraswamy-Weibull (NKwW) distribution. Parameter estimation is dealt and maximum likelihood estimators are assessed through simulation study. Two real life data sets are analyzed to illustrate the importance and flexibility of this distribution. In fact, this model outperforms some generalized Weibull models such as the Kumaraswamy-Weibull, McDonald-Weibull, beta-Weibull, exponentiated-generalized Weibull, gamma-Weibull, odd log-logistic-Weibull, Marshall-Olkin-Weibull, transmuted-Weibull, exponentiated-Weibull and Weibull distributions when applied to these data sets. The bivariate extension of the family is proposed and the estimation of parameters is given. The usefulness of the bivariate NKwW model is illustrated empirically by means of a real-life data set.


2017 ◽  
Author(s):  
Pavel Sagulenko ◽  
Vadim Puller ◽  
Richard A. Neher

Mutations that accumulate in the genome of replicating biological organisms can be used to infer their evolutionary history. In the case of measurably evolving organisms genomes often reveal their detailed spatiotemporal spread. Such phylodynamic analyses are particularly useful to understand the epidemiology of rapidly evolving viral pathogens. The number of genome sequences available for different pathogens, however, has increased dramatically over the last couple of years and traditional methods for phylodynamic analysis scale poorly with growing data sets. Here, we present TreeTime, a Python based framework for phylodynamic analysis using an approximate Maximum Likelihood approach. TreeTime can estimate ancestral states, infer evolution models, reroot trees to maximize temporal signals, estimate molecular clock phylogenies and population size histories. The run time of TreeTime scales linearly with data set size.


2017 ◽  
Vol 17 (3) ◽  
pp. 239-262 ◽  
Author(s):  
Marek Vokoun

Abstract A sample of 18 papers and 32 data sets revealed 210,404 firm level observations about European firms making decisions about innovation. A total of 66,965 observations describe activities of innovators between 1986 and 2008. This paper used a basic literature review to assess properties of innovation among quite rare full CDM (Crépon, Duguet, and Mairesse) papers. This study compared results from two systems of estimation and showed that both international and regional comparisons are rather problematic because of different definitions of innovation variables and data set representativeness. On average, a typical firm that engaged in innovation was a large firm competing in international markets in the sample of firms with 20+ employees. Smaller firms, however, invested more in research and development (R&D) and no linear relationship was found for output characteristics. Cooperation on R&D projects increased overall innovation intensity. There is strong evidence that public funding had an ambiguous effect on R&D spending and no additional effect on innovation output on average. This output measured by sales from innovated goods and services was on average in a positive relationship with labour productivity; however, a detailed view suggested this effect was present only in product innovation. In this paper, it is shown that results of innovation studies cannot be compared or used in research without deeper analysis of the data sample (micro companies, industries, active firms, entrants etc.), dependent variable (innovator, R&D expenditures, sales, productivity, new product, new service etc.) and the baseline company that is defined by independent variables.


Sign in / Sign up

Export Citation Format

Share Document