scholarly journals An Empirical Comparison of Portuguese and Multilingual BERT Models for Auto-Classification of NCM Codes in International Trade

2022 ◽  
Vol 6 (1) ◽  
pp. 8
Author(s):  
Roberta Rodrigues de Lima ◽  
Anita M. R. Fernandes ◽  
James Roberto Bombasar ◽  
Bruno Alves da Silva ◽  
Paul Crocker ◽  
...  

Classification problems are common activities in many different domains and supervised learning algorithms have shown great promise in these areas. The classification of goods in international trade in Brazil represents a real challenge due to the complexity involved in assigning the correct category codes to a good, especially considering the tax penalties and legal implications of a misclassification. This work focuses on the training process of a classifier based on bidirectional encoder representations from transformers (BERT) for tax classification of goods with MCN codes which are the official classification system for import and export products in Brazil. In particular, this article presents results from using a specific Portuguese-language-pretrained BERT model, as well as results from using a multilingual-pretrained BERT model. Experimental results show that Portuguese model had a slightly better performance than the multilingual model, achieving an MCC 0.8491, and confirms that the classifiers could be used to improve specialists’ performance in the classification of goods.

Author(s):  
Roberta Rodrigues de Lima ◽  
Anita M. R. Fernandes ◽  
James Roberto Bombasar ◽  
Bruno Alves da Silva ◽  
Paul Crocker ◽  
...  

The classification of goods involved in international trade in Brazil is based on the Mercosur Common Nomenclature (NCM). The classification of these goods represents a real challenge due to the complexity involved in assigning the correct category codes especially considering the legal and fiscal implications of misclassification. This work focuses on the training of a classifier based on Bidirectional En-coder Representations from Transformers (BERT) for the tax classification of goods with NCM codes. In particular, this article presents results from using a specific Portuguese Language tuned BERT model as well results from using a Multilingual BERT. Experimental results justify the use of these models in the classification process and also that the language specific model has a slightly better performance.


2021 ◽  
Vol 13 (9) ◽  
pp. 1623
Author(s):  
João E. Batista ◽  
Ana I. R. Cabral ◽  
Maria J. P. Vasconcelos ◽  
Leonardo Vanneschi ◽  
Sara Silva

Genetic programming (GP) is a powerful machine learning (ML) algorithm that can produce readable white-box models. Although successfully used for solving an array of problems in different scientific areas, GP is still not well known in the field of remote sensing. The M3GP algorithm, a variant of the standard GP algorithm, performs feature construction by evolving hyperfeatures from the original ones. In this work, we use the M3GP algorithm on several sets of satellite images over different countries to create hyperfeatures from satellite bands to improve the classification of land cover types. We add the evolved hyperfeatures to the reference datasets and observe a significant improvement of the performance of three state-of-the-art ML algorithms (decision trees, random forests, and XGBoost) on multiclass classifications and no significant effect on the binary classifications. We show that adding the M3GP hyperfeatures to the reference datasets brings better results than adding the well-known spectral indices NDVI, NDWI, and NBR. We also compare the performance of the M3GP hyperfeatures in the binary classification problems with those created by other feature construction methods such as FFX and EFS.


Author(s):  
Leandro Skowronski ◽  
Paula Martin de Moraes ◽  
Mario Luiz Teixeira de Moraes ◽  
Wesley Nunes Gonçalves ◽  
Michel Constantino ◽  
...  

Algorithms ◽  
2021 ◽  
Vol 14 (5) ◽  
pp. 134
Author(s):  
Loai Abdallah ◽  
Murad Badarna ◽  
Waleed Khalifa ◽  
Malik Yousef

In the computational biology community there are many biological cases that are considered as multi-one-class classification problems. Examples include the classification of multiple tumor types, protein fold recognition and the molecular classification of multiple cancer types. In all of these cases the real world appropriately characterized negative cases or outliers are impractical to achieve and the positive cases might consist of different clusters, which in turn might lead to accuracy degradation. In this paper we present a novel algorithm named MultiKOC multi-one-class classifiers based K-means to deal with this problem. The main idea is to execute a clustering algorithm over the positive samples to capture the hidden subdata of the given positive data, and then building up a one-class classifier for every cluster member’s examples separately: in other word, train the OC classifier on each piece of subdata. For a given new sample, the generated classifiers are applied. If it is rejected by all of those classifiers, the given sample is considered as a negative sample, otherwise it is a positive sample. The results of MultiKOC are compared with the traditional one-class, multi-one-class, ensemble one-classes and two-class methods, yielding a significant improvement over the one-class and like the two-class performance.


2021 ◽  
Vol 503 (2) ◽  
pp. 1828-1846
Author(s):  
Burger Becker ◽  
Mattia Vaccari ◽  
Matthew Prescott ◽  
Trienko Grobler

ABSTRACT The morphological classification of radio sources is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists, and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is convolutional neural networks (CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxy morphological classification. First, a proper analysis of whether overfitting occurs when training CNNs to perform radio galaxy morphological classification using a small curated training set is needed. Secondly, a good comparative study regarding the practical applicability of the CNN architectures in literature is required. Both of these shortcomings are addressed in this paper. Multiple performance metrics are used for the latter comparative study, such as inference time, model complexity, computational complexity, and mean per class accuracy. As part of this study, we also investigate the effect that receptive field, stride length, and coverage have on recognition performance. For the sake of completeness, we also investigate the recognition performance gains that we can obtain by employing classification ensembles. A ranking system based upon recognition and computational performance is proposed. MCRGNet, Radio Galaxy Zoo, and ConvXpress (novel classifier) are the architectures that best balance computational requirements with recognition performance.


1997 ◽  
Vol 08 (01) ◽  
pp. 15-41 ◽  
Author(s):  
Carl H. Smith ◽  
Rolf Wiehagen ◽  
Thomas Zeugmann

The present paper studies a particular collection of classification problems, i.e., the classification of recursive predicates and languages, for arriving at a deeper understanding of what classification really is. In particular, the classification of predicates and languages is compared with the classification of arbitrary recursive functions and with their learnability. The investigation undertaken is refined by introducing classification within a resource bound resulting in a new hierarchy. Furthermore, a formalization of multi-classification is presented and completely characterized in terms of standard classification. Additionally, consistent classification is introduced and compared with both resource bounded classification and standard classification. Finally, the classification of families of languages that have attracted attention in learning theory is studied, too.


2019 ◽  
pp. 28-53
Author(s):  
Igor Martins Oliveira ◽  
Luiz Andrei Gonçalves Pereira

Na era globalização, a economia mundial tem vivenciado um processo de reestruturação produtiva, intensificando os fluxos nos territórios inerentes às interações espaciais de recursos, de bens e de serviços que circulam entre os mercados nacionais e internacionais. O objetivo deste trabalho é analisar as dinâmicas socioespaciais dos fluxos de comércio internacional do estado de Minas Gerais por meio da logística das redes de importações e de exportações de frutas, no período de 2000 a 2017. Como resultado, identificou-se que, no mercado externo de frutas, Minas Gerais se relaciona comercialmente com 88 países, sendo 52 nas redes de exportação e 36 na rede de importação. Na operacionalização dos fluxos no comércio global, a logística de transportes foi realizada através dos modais rodoviário, marítimo e aéreo, configurando-se como um elemento geográfico, visto que as transações comerciais demandam o gerenciamento da fluidez, do planejamento e da organização dos diferentes territórios.PALAVRAS-CHAVE: Logística, Comércio Internacional, Fruticultura. ABSTRACTIn the era of globalization, the world economy has undergone a process of productive restructuring, intensifying flows in the territories inherent to the spatial interactions of resources, goods and services that circulate between national and international markets. The objective of this work is to analyze the sociospatial dynamics of the international trade flows of the state of Minas Gerais through the logistics of import and export fruit networks, from 2000 to 2017. As a result, it was identified that in the market Minas Gerais has a commercial relationship with 88 countries, 52 in export networks and 36 in the import network. In the operationalization of flows in global trade, transport logistics was carried out through the road, sea and air modalities, being configured as a geographic element, since commercial transactions demand the management of the fluidity, planning and organization of the different territories.KEYWORDS: Logistic, International Trade, Fruticulture.


2017 ◽  
Vol 117 (6) ◽  
pp. 1109-1126 ◽  
Author(s):  
Shubhadeep Mukherjee ◽  
Pradip Kumar Bala

Purpose The purpose of this paper is to study sarcasm in online text – specifically on twitter – to better understand customer opinions about social issues, products, services, etc. This can be immensely helpful in reducing incorrect classification of consumer sentiment toward issues, products and services. Design/methodology/approach In this study, 5,000 tweets were downloaded and analyzed. Relevant features were extracted and supervised learning algorithms were applied to identify the best differentiating features between a sarcastic and non-sarcastic sentence. Findings The results using two different classification algorithms, namely, Naïve Bayes and maximum entropy show that function words and content words together are most effective in identifying sarcasm in tweets. The most differentiating features between a sarcastic and a non-sarcastic tweet were identified. Practical implications Understanding the use of sarcasm in tweets let companies do better sentiment analysis and product recommendations for users. This could help businesses attract new customers and retain the old ones resulting in better customer management. Originality/value This paper uses novel features to identify sarcasm in online text which is one of the most challenging problems in natural language processing. To the authors’ knowledge, this is the first study on sarcasm detection from a customer management perspective.


Author(s):  
I. Kotlyarov

The paper contains an analysis of the existing types of outsourcing. It is demonstrated that outsourcing can be analyzed from managerial and economical points of view. A classification of types of outsourcing based on their economical nature is proposed. Distinctive features of outsourcing are put in evidence. Models of interaction between companies in case of outsourcing are described.


Author(s):  
Rehan Ullah ◽  
Abdullah Khan ◽  
Syed Bakhtawar Shah Abid ◽  
Siyab Khan ◽  
Said Khalid Shah ◽  
...  

DNA sequence classification is one of the main research activities in bioinformatics on which, many researchers have worked and are working on it. In bioinformatics, machine learning can be applied for the analysis of genomic sequences like the classification of DNA sequences, comparison of DNA sequences. This article proposes a new hybrid meta-heuristic model called Crow-ENN for leukemia DNA sequences classification. The proposed algorithm is the combination of the Crow Search Algorithm (CSA) and the Elman Neural Network (ENN). DNA sequences of Leukemia are used to train and test the proposed hybrid model. Five other comparable models i.e. Crow-ANN, Crow-BPNN, ANN, BPNN and ENN are also trained and tested on these DNA sequences. The performance of models is evaluated in terms of accuracy and MSE. The overall simulation results show that the proposed model has outperformed all the other five comparable models by attaining the highest accuracy of over 99%. This model may also be used for other classification problems in different fields because it can achieve promising results.


Sign in / Sign up

Export Citation Format

Share Document