scholarly journals The Influence of Feature Representation of Text on the Performance of Document Classification

2019 ◽  
Vol 9 (4) ◽  
pp. 743 ◽  
Author(s):  
Sanda Martinčić-Ipšić ◽  
Tanja Miličić ◽  
and Todorovski

In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.

Author(s):  
Jiajie Peng ◽  
Hansheng Xue ◽  
Zhongyu Wei ◽  
Idil Tuncali ◽  
Jianye Hao ◽  
...  

Abstract Motivation The emergence of abundant biological networks, which benefit from the development of advanced high-throughput techniques, contributes to describing and modeling complex internal interactions among biological entities such as genes and proteins. Multiple networks provide rich information for inferring the function of genes or proteins. To extract functional patterns of genes based on multiple heterogeneous networks, network embedding-based methods, aiming to capture non-linear and low-dimensional feature representation based on network biology, have recently achieved remarkable performance in gene function prediction. However, existing methods do not consider the shared information among different networks during the feature learning process. Results Taking the correlation among the networks into account, we design a novel semi-supervised autoencoder method to integrate multiple networks and generate a low-dimensional feature representation. Then we utilize a convolutional neural network based on the integrated feature embedding to annotate unlabeled gene functions. We test our method on both yeast and human datasets and compare with three state-of-the-art methods. The results demonstrate the superior performance of our method. We not only provide a comprehensive analysis of the performance of the newly proposed algorithm but also provide a tool for extracting features of genes based on multiple networks, which can be used in the downstream machine learning task. Availability DeepMNE-CNN is freely available at https://github.com/xuehansheng/DeepMNE-CNN Contact [email protected]; [email protected]; [email protected]


2019 ◽  
Author(s):  
Hansheng Xue ◽  
Jiajie Peng ◽  
Xuequn Shang

AbstractMotivationThe emerging of abundant biological networks, which benefit from the development of advanced high-throughput techniques, contribute to describing and modeling complex internal interactions among biological entities such as genes and proteins. Multiple networks provide rich information for inferring the function of genes or proteins. To extract functional patterns of genes based on multiple heterogeneous networks, network embedding-based methods, aiming to capture non-linear and low-dimensional feature representation based on network biology, have recently achieved remarkable performance in gene function prediction. However, existing methods mainly do not consider the shared information among different networks during the feature learning process. Thus, we propose a novel multi-networks embedding-based function prediction method based on semi-supervised autoencoder and feature convolution neural network, named DeepMNE-CNN, which captures complex topological structures of multi-networks and takes the correlation among multi-networks into account.ResultsWe design a novel semi-supervised autoencoder method to integrate multiple networks and generate a low-dimensional feature representation. Then we utilize a convolutional neural network based on the integrated feature embedding to annotate unlabeled gene functions. We test our method on both yeast and human dataset and compare with four state-of-the-art methods. The results demonstrate the superior performance of our method over four state-of-the-art algorithms. From the future explorations, we find that semi-supervised autoencoder based multi-networks integration method and CNN-based feature learning methods both contribute to the task of function prediction.AvailabilityDeepMNE-CNN is freely available at https://github.com/xuehansheng/DeepMNE-CNN


2021 ◽  
pp. 1-11
Author(s):  
Velichka Traneva ◽  
Stoyan Tranev

Analysis of variance (ANOVA) is an important method in data analysis, which was developed by Fisher. There are situations when there is impreciseness in data In order to analyze such data, the aim of this paper is to introduce for the first time an intuitionistic fuzzy two-factor ANOVA (2-D IFANOVA) without replication as an extension of the classical ANOVA and the one-way IFANOVA for a case where the data are intuitionistic fuzzy rather than real numbers. The proposed approach employs the apparatus of intuitionistic fuzzy sets (IFSs) and index matrices (IMs). The paper also analyzes a unique set of data on daily ticket sales for a year in a multiplex of Cinema City Bulgaria, part of Cineworld PLC Group, applying the two-factor ANOVA and the proposed 2-D IFANOVA to study the influence of “ season ” and “ ticket price ” factors. A comparative analysis of the results, obtained after the application of ANOVA and 2-D IFANOVA over the real data set, is also presented.


2021 ◽  
pp. 016344372110227
Author(s):  
Yingzi Wang ◽  
Thoralf Klein

This paper examines the changes and continuities in TV representations of Chinese Communist Party’s revolutionary history and interprets them within the broader context of China’s political, economic and cultural transformations since the 1990s. Drawing on a comparative analysis of three state-sponsored TV dramas produced between the late 1990s and mid-2010s, it traces how the state-sanctioned revolutionary narratives have changed over time in response to the Party’s propaganda imperatives on the one hand, and to the market-oriented production environment on the other. The paper argues that while recent TV productions in the new century have made increasing concessions to audience taste by adopting visually stimulating depictions and introducing fictional characters as points of identification for the audience, the revolutionary narratives were still aligned with the Party’s propaganda agenda at different times. This shows the ongoing competition between ideological and commercial interests in Chinese TV production during the era of market reforms.


2018 ◽  
Vol 35 (16) ◽  
pp. 2757-2765 ◽  
Author(s):  
Balachandran Manavalan ◽  
Shaherin Basith ◽  
Tae Hwan Shin ◽  
Leyi Wei ◽  
Gwang Lee

AbstractMotivationCardiovascular disease is the primary cause of death globally accounting for approximately 17.7 million deaths per year. One of the stakes linked with cardiovascular diseases and other complications is hypertension. Naturally derived bioactive peptides with antihypertensive activities serve as promising alternatives to pharmaceutical drugs. So far, there is no comprehensive analysis, assessment of diverse features and implementation of various machine-learning (ML) algorithms applied for antihypertensive peptide (AHTP) model construction.ResultsIn this study, we utilized six different ML algorithms, namely, Adaboost, extremely randomized tree (ERT), gradient boosting (GB), k-nearest neighbor, random forest (RF) and support vector machine (SVM) using 51 feature descriptors derived from eight different feature encodings for the prediction of AHTPs. While ERT-based trained models performed consistently better than other algorithms regardless of various feature descriptors, we treated them as baseline predictors, whose predicted probability of AHTPs was further used as input features separately for four different ML-algorithms (ERT, GB, RF and SVM) and developed their corresponding meta-predictors using a two-step feature selection protocol. Subsequently, the integration of four meta-predictors through an ensemble learning approach improved the balanced prediction performance and model robustness on the independent dataset. Upon comparison with existing methods, mAHTPred showed superior performance with an overall improvement of approximately 6–7% in both benchmarking and independent datasets.Availability and implementationThe user-friendly online prediction tool, mAHTPred is freely accessible at http://thegleelab.org/mAHTPred.Supplementary informationSupplementary data are available at Bioinformatics online.


10.28945/3033 ◽  
2006 ◽  
Author(s):  
G. Adesola Aderounmu ◽  
Bosede Oyatokun ◽  
Matthew Adigun

This paper presents a comparative analysis of Remote Method Invocation (RMI) and Mobile Agent (MA) paradigm used to implement the information storage and retrieval system in a distributed computing environment. Simulation program was developed to measure the performance of MA and RMI using object oriented programming language (the following parameters: search time, fault tolerance and invocation cost. We used search time, fault tolerance and invocation cost as performance parameters in this research work. Experimental results showed that Mobile Agent paradigm offers a superior performance compared to RMI paradigm, offers fast computational speed; procure lower invocation cost by making local invocations instead of remote invocations over the network, thereby reducing network bandwidth. Finally MA has a better fault tolerance than the RMI. With a probability of failure pr = 0.1, mobile agent degrades gracefully.


2003 ◽  
Vol 2 (3) ◽  
pp. 75-80
Author(s):  
I. D. Yevtushenko ◽  
A. Sh. Makhmutkhodzhayev ◽  
T. V. Ivanova ◽  
O. V. Parshina ◽  
I. A. Ryzhova ◽  
...  

A clinical prospective examination of 90 women with complete pregnancy and indications for labor induction because of unsatisfactory maturity of uterus cervix has been made. The aim was to create a comparative analysis of efficiency of intravaginal introduction of prostaglandin synthetic analogue E1 misoprostol («Sytotec») and intracervical introduction of prostaglandin E2 dinoprostone («Prepidil» gel) for uterus cervix preparation and labor induction at complete pregnancy. Misoprostol in a dose of 25 mkg has been introduced to pregnant women of the 1 group (n=44), every 4 hours not more than 3 times. In case of discharge of waters or labor activity the second introduction has not been done. Dinoprostone has been introduced intracervically in a single dose to pregnant women of the 2 group (n=46). The use of misoprostol has been accompanied by spontaneous beginning of labor activity by 2 times more often than the use of dinoprostone. The quantity of vaginal births within 12 and 24 hours of observation has been surely greater and the duration of time between the beginning of introduction and labor has been surely smaller in the group of women received misoprostol as compared to the one received dinoprostone. It has not been revealed any differences between examined groups by the frequency of uterus hyperstimulation symptom development, labor duration, frequency of abdominal and vaginal labor, as well as perinatal outcomes.


Author(s):  
Fareed Moosa

Sections 45 and 63 of the Tax Administration Act 28 of 2011 (TAA) confer drastic information gathering powers on officials of the South African Revenue Service (SARS). On the one hand, section 45 permits warrantless routine (non-targeted) and non-routine (targeted) inspections by a SARS official in respect of records, books of accounts and documents found at premises where a taxpayer is reasonably believed to be conducting a trade or enterprise. The purpose of such inspection is to determine whether there has been compliance with specific obligations by the taxpayer. Section 63, on the other hand, permits, on the grounds of urgency and expediency in exceptional circumstances only, warrantless non-routine (targeted) searches by a senior SARS official of a taxpayer and of third parties associated with a taxpayer, as well as searches of a taxpayer's premises and those of third parties. In addition, section 63 permits the seizure of relevant material found at premises searched. All searches and seizures must occur for the purposes of the efficient and effective administration of tax Acts generally. A comparative analysis of sections 45 and 63 of the TAA reveals the existence of key differences in the substance and practical operation of their provisions. This article distils these differences through an in-depth discussion of the nature and extent of the powers of inspection and search conferred by these provisions, as well as by conceptualising the terms “inspection” and “search” for the purposes of sections 45 and 63 respectively.    


2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


2017 ◽  
Vol 33 (1) ◽  
pp. 155-186
Author(s):  
Marcela Cohen Martelotte ◽  
Reinaldo Castro Souza ◽  
Eduardo Antônio Barros da Silva

Abstract Considering that many macroeconomic time series present changing seasonal behaviour, there is a need for filters that are robust to such changes. This article proposes a method to design seasonal filters that address this problem. The design was made in the frequency domain to estimate seasonal fluctuations that are spread around specific bands of frequencies. We assessed the generated filters by applying them to artificial data with known seasonal behaviour based on the ones of the real macroeconomic series, and we compared their performance with the one of X-13A-S. The results have shown that the designed filters have superior performance for series with pronounced moving seasonality, being a good alternative in these cases.


Sign in / Sign up

Export Citation Format

Share Document