Efficient text feature extraction by integrating the average linkage and K-medoids clustering

2021 ◽  
pp. 2150151
Author(s):  
Dasong Sun

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.

2014 ◽  
Vol 1046 ◽  
pp. 444-448 ◽  
Author(s):  
Lu Chen ◽  
Tao Zhang ◽  
Yuan Yuan Ma ◽  
Cheng Zhou

With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.


2014 ◽  
Vol 599-601 ◽  
pp. 1824-1828
Author(s):  
Juan Wang ◽  
Zhi Xun Zhang ◽  
Yong Dong Wang

Feature extraction is a key point of text categorization[1]. The accuracy of extraction will directly affect the accuracy of text classification. This paper introduces and compares 4 commonly used methods of text feature extraction: IG (Information gain), MI (Mutual information), CHI (statistics), DF (Document frequency), and proposes an improved method based on the method of CHI. Experiment result shows that the proposed method can improve the accuracy of text categorization.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Lili Chen ◽  
Yaru Hao

Preterm birth (PTB) is the leading cause of perinatal mortality and long-term morbidity, which results in significant health and economic problems. The early detection of PTB has great significance for its prevention. The electrohysterogram (EHG) related to uterine contraction is a noninvasive, real-time, and automatic novel technology which can be used to detect, diagnose, or predict PTB. This paper presents a method for feature extraction and classification of EHG between pregnancy and labour group, based on Hilbert-Huang transform (HHT) and extreme learning machine (ELM). For each sample, each channel was decomposed into a set of intrinsic mode functions (IMFs) using empirical mode decomposition (EMD). Then, the Hilbert transform was applied to IMF to obtain analytic function. The maximum amplitude of analytic function was extracted as feature. The identification model was constructed based on ELM. Experimental results reveal that the best classification performance of the proposed method can reach an accuracy of 88.00%, a sensitivity of 91.30%, and a specificity of 85.19%. The area under receiver operating characteristic (ROC) curve is 0.88. Finally, experimental results indicate that the method developed in this work could be effective in the classification of EHG between pregnancy and labour group.


Author(s):  
Stuti Mehta ◽  
Suman K. Mitra

Text classification is an extremely important area of Natural Language Processing (NLP). This paper studies various methods for embedding and classification in the Gujarati language. The dataset comprises of Gujarati News Headlines classified into various categories. Different embedding methods for Gujarati language and various classifiers are used to classify the headlines into given categories. Gujarati is a low resource language. This language is not commonly worked upon. This paper deals with one of the most important NLP tasks - classification and along with it, an idea about various embedding techniques for Gujarati language can be obtained since they help in feature extraction for the process of classification. This paper first performs embedding to get a valid representation of the textual data and then uses already existing robust classifiers to perform classification over the embedded data. Additionally, the paper provides an insight into how various NLP tasks can be performed over a low resource language like Gujarati. Finally, the research paper carries out a comparative analysis between the performances of various existing methods of embedding and classification to get an idea of which combination gives a better outcome.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Athar A. Ein Shoka ◽  
Monagi H. Alkinani ◽  
A. S. El-Sherbeny ◽  
Ayman El-Sayed ◽  
Mohamed M. Dessouky

AbstractSeizure is an abnormal electrical activity of the brain. Neurologists can diagnose the seizure using several methods such as neurological examination, blood tests, computerized tomography (CT), magnetic resonance imaging (MRI) and electroencephalogram (EEG). Medical data, such as the EEG signal, usually includes a number of features and attributes that do not contains important information. This paper proposes an automatic seizure classification system based on extracting the most significant EEG features for seizure diagnosis. The proposed algorithm consists of five steps. The first step is the channel selection to minimize dimensionality by selecting the most affected channels using the variance parameter. The second step is the feature extraction to extract the most relevant features, 11 features, from the selected channels. The third step is to average the 11 features extracted from each channel. Next, the fourth step is the classification of the average features using the classification step. Finally, cross-validation and testing the proposed algorithm by dividing the dataset into training and testing sets. This paper presents a comparative study of seven classifiers. These classifiers were tested using two different methods: random case testing and continuous case testing. In the random case process, the KNN classifier had greater precision, specificity, positive predictability than the other classifiers. Still, the ensemble classifier had a higher sensitivity and a lower miss-rate (2.3%) than the other classifiers. For the continuous case test method, the ensemble classifier had higher metric parameters than the other classifiers. In addition, the ensemble classifier was able to detect all seizure cases without any mistake.


1997 ◽  
Vol 6 (1) ◽  
pp. 57-62 ◽  
Author(s):  
Wayne O. Olsen ◽  
Terri L. Pratt ◽  
Christopher D. Bauch
Keyword(s):  

Multichannel ABR recordings for 30 otoneurologic patients were reviewed independently by three audiologists to assess interjudge consistency in determining absolute latencies and overall interpretation of ABR results. Four months later, the tracings were reviewed a second time to evaluate intrajudge consistency in interpretation of ABR waveforms. Interjudge agreement in marking latencies for waves I, III, and V within 0.2 ms was on the order of 90% or better. Intrajudge consistency was slightly higher. Only rarely did inter- or intrajudge differences in latency measurements exceed 0.3 ms. Agreement in overall interpretation of ABR results as "normal" or "abnormal" was unanimous for 90% of the patients. Across pairs of judges, the agreement for "normal" and "abnormal" classification of the ABR tracings was 97%. Intrajudge consistency for "normal" and "abnormal" categorization of the ABR results was 100% for one judge, 97% for the other two judges.


Author(s):  
I. R. Khuzina ◽  
V. N. Komarov

The paper considers a point of view, based on the conception of the broad understanding of taxons. According to this point of view, rhyncholites of the subgenus Dentatobeccus and Microbeccus are accepted to be synonymous with the genus Rhynchoteuthis, and subgenus Romanovichella is considered to be synonymous with the genus Palaeoteuthis. The criteria, exercising influence on the different approaches to the classification of rhyncholites, have been analyzed (such as age and individual variability, sexual dimorphism, pathological and teratological features, degree of disintegration of material), underestimation of which can lead to inaccuracy. Divestment of the subgenuses Dentatobeccus, Microbeccus and Romanovichella, possessing very bright morphological characteristics, to have an independent status and denomination to their synonyms, has been noted to be unjustified. An artificial system (any suggested variant) with all its minuses is a single probable system for rhyncholites. The main criteria, minimizing its negative sides and proving the separation of the new taxon, is an available mass-scale material. The narrow understanding of the genus, used in sensible limits, has been underlined to simplify the problem of the passing the view about the genus to the other investigators and recognition of rhyncholites for the practical tasks.


Author(s):  
Padmavathi .S ◽  
M. Chidambaram

Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.


Author(s):  
I. Kukhtevich

Functional autonomic disorders occupy a significant part in the practice of neurologists and professionals of other specialties as well. However, there is no generally accepted classification of such disorders. In this paper the authors tried to show that functional autonomic pathology corresponds to the concept of somatoform disorders combining syndromes manifested by visceral, borderline psychopathological, neurological symptoms that do not have an organic basis. The relevance of the problem of somatoform disorders is that on the one hand many health professionals are not familiar enough with manifestations of borderline neuropsychiatric disorders, often forming functional autonomic disorders, and on the other hand they overestimate somatoform symptoms that are similar to somatic diseases.


ARTic ◽  
2019 ◽  
Vol 4 ◽  
pp. 167-176
Author(s):  
Risti Puspita Sari Hunowu

This research is aimed at studying the Hunto Sultan Amay Mosque located in Gorontalo City. Hunto Sultan Amay Mosque is the oldest mosque in the city of Gorontalo The Hunto Sultan Amay Mosque was built as proof of Sultan Amay's love for a daughter and is a representation of Islam in Gorontalo. Researchers will investigate the visual form of the Hunto Sultan Amay Mosque which was originally like an ancient mosque in the archipelago. can be seen from the shape of the roof which initially used an overlapping roof and then converted into a dome as well as mosques in the world, we can be sure the Hunto Sultan Amay Mosque uses a dome roof after the arrival of Dutch Colonial. The researcher used a qualitative method by observing the existing form in detail from the building of the mosque with an aesthetic approach, reviewing objects and selecting the selected ornament giving a classification of the shapes, so that the section became a reference for the author as research material. Based on the analysis of this thesis, the form  of the Hunto Sultan Amay mosque as well as the mosques located in the archipelago and the existence of ornaments in the Hunto Sultan Amay Mosque as a decorative structure support the grandeur of a mosque. On the other hand, Hunto Mosque ornaments reveal a teaching. The form of a teaching is manifested in the form of motives and does not depict living beings in a realist or naturalist manner. the decorative forms of the Hunto Sultan Sultan Mosque in general tend to lead to a form of flora, geometric ornaments, and ornament of calligraphy dominated by the distinctive colors of Islam, namely gold, white, red, yellow and green.


Sign in / Sign up

Export Citation Format

Share Document