Study of text classification methods for data sets with huge features

Author(s):  
Guiying Wei ◽  
Xuedong Gao ◽  
Sen Wu
2019 ◽  
Vol 7 ◽  
pp. 201-215 ◽  
Author(s):  
Jihyeok Kim ◽  
Reinald Kim Amplayo ◽  
Kyungjae Lee ◽  
Sua Sung ◽  
Minji Seo ◽  
...  

The performance of text classification has improved tremendously using intelligently engineered neural-based models, especially those injecting categorical metadata as additional information, e.g., using user/product information for sentiment classification. This information has been used to modify parts of the model (e.g., word embeddings, attention mechanisms) such that results can be customized according to the metadata. We observe that current representation methods for categorical metadata, which are devised for human consumption, are not as effective as claimed in popular classification methods, outperformed even by simple concatenation of categorical features in the final layer of the sentence encoder. We conjecture that categorical features are harder to represent for machine use, as available context only indirectly describes the category, and even such context is often scarce (for tail category). To this end, we propose using basis vectors to effectively incorporate categorical metadata on various parts of a neural-based model. This additionally decreases the number of parameters dramatically, especially when the number of categorical features is large. Extensive experiments on various data sets with different properties are performed and show that through our method, we can represent categorical metadata more effectively to customize parts of the model, including unexplored ones, and increase the performance of the model greatly.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Le Wang ◽  
Meng Han ◽  
Xiaojuan Li ◽  
Ni Zhang ◽  
Haodong Cheng

2005 ◽  
Vol 01 (01) ◽  
pp. 129-145 ◽  
Author(s):  
XIAOBO ZHOU ◽  
XIAODONG WANG ◽  
EDWARD R. DOUGHERTY

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.


RSC Advances ◽  
2018 ◽  
Vol 8 (1) ◽  
pp. 10-21 ◽  
Author(s):  
Anita Rácz ◽  
Attila Gere ◽  
Dávid Bajusz ◽  
Károly Héberger

A thorough survey of classification data sets and a rigorous comparison of classification methods show the unambiguous superiority of other techniques over soft independent modeling of class analogies (SIMCA – one class modeling) for classification.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Samar Ali Shilbayeh ◽  
Sunil Vadera

Purpose This paper aims to describe the use of a meta-learning framework for recommending cost-sensitive classification methods with the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” Design/methodology/approach This paper describes the use of a meta-learning framework for recommending cost-sensitive classification methods for the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” The framework is based on the idea of applying machine learning techniques to discover knowledge about the performance of different machine learning algorithms. It includes components that repeatedly apply different classification methods on data sets and measures their performance. The characteristics of the data sets, combined with the algorithms and the performance provide the training examples. A decision tree algorithm is applied to the training examples to induce the knowledge, which can then be used to recommend algorithms for new data sets. The paper makes a contribution to both meta-learning and cost-sensitive machine learning approaches. Those both fields are not new, however, building a recommender that recommends the optimal case-sensitive approach for a given data problem is the contribution. The proposed solution is implemented in WEKA and evaluated by applying it on different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. The developed solution takes the misclassification cost into consideration during the learning process, which is not available in the compared project. Findings The proposed solution is implemented in WEKA and evaluated by applying it to different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. Originality/value The paper presents a major piece of new information in writing for the first time. Meta-learning work has been done before but this paper presents a new meta-learning framework that is costs sensitive.


Author(s):  
Saranya N. ◽  
Saravana Selvam

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.


Sign in / Sign up

Export Citation Format

Share Document