Analyzing Linguistic Features for Classifying Why-Type Non-Factoid Questions

Author(s):  
Manvi Breja ◽  
Sanjay Kumar Jain

Why-type non-factoid questions are complex and difficult to answer compared to factoid questions. A challenge in finding an accurate answer to a non-factoid question is to understand the intent of user as it differs with their knowledge and also the context of the question in which it is being asked. Predicting correct type of a question and its answer by a classification model is an important issue as it affects the subsequent processing of its answer. In this paper, a classification model is proposed which is trained by a combination of lexical, syntactic, and semantic features to classify open-domain why-type questions. Various supervised classifiers are trained on a featured dataset out of which support vector machine achieves the highest accuracy of 81% in determining question type and 76.8% in determining answer type which shows 14.6% improvement in predicting an answer type over a baseline why-type classifier with 62.2% accuracy.

2020 ◽  
Vol 15 ◽  
Author(s):  
Chun Qiu ◽  
Sai Li ◽  
Shenghui Yang ◽  
Lin Wang ◽  
Aihui Zeng ◽  
...  

Aim: To search the genes related to the mechanisms of the occurrence of glioma and to try to build a prediction model for glioblastomas. Background: The morbidity and mortality of glioblastomas are very high, which seriously endangers human health. At present, the goals of many investigations on gliomas are mainly to understand the cause and mechanism of these tumors at the molecular level and to explore clinical diagnosis and treatment methods. However, there is no effective early diagnosis method for this disease, and there are no effective prevention, diagnosis or treatment measures. Methods: First, the gene expression profiles derived from GEO were downloaded. Then, differentially expressed genes (DEGs) in the disease samples and the control samples were identified. After that, GO and KEGG enrichment analyses of DEGs were performed by DAVID. Furthermore, the correlation-based feature subset (CFS) method was applied to the selection of key DEGs. In addition, the classification model between the glioblastoma samples and the controls was built by an Support Vector Machine (SVM) based on selected key genes. Results and Discussion: Thirty-six DEGs, including 17 upregulated and 19 downregulated genes, were selected as the feature genes to build the classification model between the glioma samples and the control samples by the CFS method. The accuracy of the classification model by using a 10-fold cross-validation test and independent set test was 76.25% and 70.3%, respectively. In addition, PPP2R2B and CYBB can also be found in the top 5 hub genes screened by the protein– protein interaction (PPI) network. Conclusions: This study indicated that the CFS method is a useful tool to identify key genes in glioblastomas. In addition, we also predicted that genes such as PPP2R2B and CYBB might be potential biomarkers for the diagnosis of glioblastomas.


2011 ◽  
Vol 181-182 ◽  
pp. 830-835
Author(s):  
Min Song Li

Latent Semantic Indexing(LSI) is an effective feature extraction method which can capture the underlying latent semantic structure between words in documents. However, it is probably not the most appropriate for text categorization to use the method to select feature subspace, since the method orders extracted features according to their variance,not the classification power. We proposed a method based on support vector machine to extract features and select a Latent Semantic Indexing that be suited for classification. Experimental results indicate that the method improves classification performance with more compact representation.


Molecules ◽  
2012 ◽  
Vol 17 (4) ◽  
pp. 4560-4582 ◽  
Author(s):  
Khac-Minh Thai ◽  
Thuy-Quyen Nguyen ◽  
Trieu-Du Ngo ◽  
Thanh-Dao Tran ◽  
Thi-Ngoc-Phuong Huynh

2019 ◽  
Vol 2 (2) ◽  
pp. 43
Author(s):  
Lalu Mutawalli ◽  
Mohammad Taufan Asri Zaen ◽  
Wire Bagye

In the era of technological disruption of mass communication, social media became a reference in absorbing public opinion. The digitalization of data is very rapidly produced by social media users because it is an attempt to represent the feelings of the audience. Data production in question is the user posts the status and comments on social media. Data production by the public in social media raises a very large set of data or can be referred to as big data. Big data is a collection of data sets in very large numbers, complex, has a relatively fast appearance time, so that makes it difficult to handle. Analysis of big data with data mining methods to get knowledge patterns in it. This study analyzes the sentiments of netizens on Twitter social media on Mr. Wiranto stabbing case. The results of the sentiment analysis showed 41% gave positive comments, 29% commented neutrally, and 29% commented negatively on events. Besides, modeling of the data is carried out using a support vector machine algorithm to create a system capable of classifying positive, neutral, and negative connotations. The classification model that has been made is then tested using the confusion matrix technique with each result is a precision value of 83%, a recall value of 80%, and finally, as much as 80% obtained in testing the accuracy.


Author(s):  
Noviah Dwi Putranti ◽  
Edi Winarko

AbstrakAnalisis sentimen dalam penelitian ini merupakan proses klasifikasi dokumen tekstual ke dalam dua kelas, yaitu kelas sentimen positif dan negatif.  Data opini diperoleh dari jejaring sosial Twitter berdasarkan query dalam Bahasa Indonesia. Penelitian ini bertujuan untuk menentukan sentimen publik terhadap objek tertentu yang disampaikan di Twitter dalam bahasa Indonesia, sehingga membantu usaha untuk melakukan riset pasar atas opini publik. Data yang sudah terkumpul dilakukan proses preprocessing dan POS tagger untuk menghasilkan model klasifikasi melalui proses pelatihan. Teknik pengumpulan kata yang memiliki sentimen dilakukan dengan pendekatan berdasarkan kamus, yang dihasilkan dalam penelitian ini berjumlah 18.069 kata. Algoritma Maximum Entropy digunakan untuk POS tagger dan algoritma yang digunakan untuk membangun model klasifikasi atas data pelatihan dalam penelitian ini adalah Support Vector Machine. Fitur yang digunakan adalah unigram dengan fitur pembobotan TFIDF. Implementasi klasifikasi diperoleh akurasi 86,81 %  pada pengujian 7 fold cross validation untuk tipe kernel Sigmoid. Pelabelan kelas secara manual dengan POS tagger menghasilkan akurasi 81,67%.  Kata kunci—analisis sentimen, klasifikasi, maximum entropy POS tagger, support vector machine, twitter.  AbstractSentiment analysis in this research classified textual documents into two classes, positive and negative sentiment. Opinion data obtained a query from social networking site Twitter of Indonesian tweet. This research uses  Indonesian tweets. This study aims to determine public sentiment toward a particular object presented in Twitter businesses conduct market. Collected data then prepocessed to help POS tagged to generate classification models through the training process. Sentiment word collection has done the dictionary based approach, which is generated in this study consists 18.069 words. Maximum Entropy algorithm is used for POS tagger and the algorithms used to build the classification model on the training data is Support Vector Machine. The unigram features used are the features of TFIDF weighting.Classification implementation 86,81 % accuration at examination of 7 validation cross fold for the type of kernel of Sigmoid. Class labeling manually with POS tagger yield accuration 81,67 %. Keywords—sentiment analysis, classification, maximum entropy POS tagger, support vector machine, twitter.


2021 ◽  
Vol 39 (11) ◽  
Author(s):  
Sahar Zolfaghari ◽  
Mohammad Hamiruce Marhaban ◽  
Siti Anom Ahmad ◽  
Asnor Juraiza Ishak ◽  
Pegah Khosropanah ◽  
...  

Motor-imagery brain-computer interfaces, as rehabilitation tools for motor-disabled individuals, could inherently enrich neuroplasticity and subsequently restore mobility. However, this endeavour's significant challenge is classifying left and right leg motor imagery tasks from non-stationary EEG signals. A subject-independent feature extraction method is essential in a BCI system, and this work involves developing a subject-independent algorithm to classify left/right leg motion intention. The Multivariate Empirical Mode Decomposition was used to decompose EEG during left and right foot movements during imagery tasks. We validated our proposed algorithm using open-access motor imagery data to detect the user's mental intention from EEG. Five subjects of various performance categories with almost 150 trials for each left/right leg MI of hand/leg/tongue, HaLT Paradigm, utilizing C3, C4, and Cz channels were examined to generalize this study to all subjects. A set of statistical features were extracted from the intrinsic mode functions, and the most relevant features were selected for classification using Sequential Floating Feature Selection. Different classifiers were trained using extracted features, and their performances' were evaluated. The findings suggest that the non-linear support vector machine is the best classification model, resulting in the mean classification sensitivity, specificity, precision, negative predictive value, F-measure, 98.15%, 90.74%, 91.97%, 98.33%, 94.72%, 94.44%, respectively. The proposed subject-independent signal processing method significantly improved the offline calibration mode by eliminating the frequency selection step, making it the common-used method for different types of MI-based BCI participants. Offline evaluations suggest that it can lead to significant increases in classification accuracy in comparison to current approaches.


Molecules ◽  
2020 ◽  
Vol 25 (6) ◽  
pp. 1442 ◽  
Author(s):  
Tao Shen ◽  
Hong Yu ◽  
Yuan-Zhong Wang

Gentiana, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial to accurately identify authentic Gentiana species. In this paper, the feasibility of using the infrared spectroscopy technique combined with chemometrics analysis to identify Gentiana and its related species was studied. A total of 180 batches of raw spectral fingerprints were obtained from 18 species of Gentiana and Tripterospermum by near-infrared (NIR: 10,000–4000 cm−1) and Fourier transform mid-infrared (MIR: 4000–600 cm−1) spectrum. Firstly, principal component analysis (PCA) was utilized to explore the natural grouping of the 180 samples. Secondly, random forests (RF), support vector machine (SVM), and K-nearest neighbors (KNN) models were built while using full spectra (including 1487 NIR variables and 1214 FT-MIR variables, respectively). The MIR-SVM model had a higher classification accuracy rate than the other models that were based on the results of the calibration sets and prediction sets. The five feature selection strategies, VIP (variable importance in the projection), Boruta, GARF (genetic algorithm combined with random forest), GASVM (genetic algorithm combined with support vector machine), and Venn diagram calculation, were used to reduce the dimensions of the data variable in order to further reduce numbers of variables for modeling. Finally, 101 NIR and 73 FT-MIR bands were selected as the feature variables, respectively. Thirdly, stacking models were built based on the optimal spectral dataset. Most of the stacking models performed better than the full spectra-based models. RF and SVM (as base learners), combined with the SVM meta-classifier, was the optimal stacked generalization strategy. For the SG-Ven-MIR-SVM model, the accuracy (ACC) of the calibration set and validation set were both 100%. Sensitivity (SE), specificity (SP), efficiency (EFF), Matthews correlation coefficient (MCC), and Cohen’s kappa coefficient (K) were all 1, which showed that the model had the optimal authenticity identification performance. Those parameters indicated that stacked generalization combined with feature selection is probably an important technique for improving the classification model predictive accuracy and avoid overfitting. The study result can provide a valuable reference for the safety and effectiveness of the clinical application of medicinal Gentiana.


2019 ◽  
Vol 2019 ◽  
pp. 1-6
Author(s):  
Lu Xu ◽  
Qiong Shi ◽  
Bang-Cheng Tang ◽  
Shunping Xie

A rapid indicator of mercury in soil using a plant (Artemisia lavandulaefolia DC., ALDC) commonly distributed in mercury mining area was established by fusion of Fourier-transform near-infrared (FT-NIR) spectroscopy coupled with least squares support vector machine (LS-SVM). The representative samples of ALDC (stem and leaf) were gathered from the surrounding and distant areas of the mercury mines. As a reference method, the total mercury contents in soil and ALDC samples were determined by a direct mercury analyzer incorporating high-temperature decomposition, catalytic adsorption for impurity removal, amalgamation capture, and atomic absorption spectrometry (AAS). Based on the FT-NIR data of ALDC samples, LS-SVM models were established to distinguish mercury-contaminated and ordinary soil. The results of reference analysis showed that the mercury level of the areas surrounding mercury mines (0–3 kilometers, 7.52–88.59 mg/kg) was significantly higher than that of the areas distant from mercury mines (>5 kilometers, 0–0.75 mg/kg). The LS-SVM classification model of ALDC samples was established based on the original spectra, smoothed spectra, second-derivative (D2) spectra, and standard normal transformation (SNV) spectra, respectively. The prediction accuracy of D2-LS-SVM was the highest (0.950). FT-NIR combined with LS-SVM modeling can quickly and accurately identify the contaminated ALDC. Compared with traditional methods which rely on naked eye observation of plants, this method is objective and more sensitive and applicable.


2020 ◽  
Vol 20 (S14) ◽  
Author(s):  
Bin Ma ◽  
Zhaolong Wu ◽  
Shengyu Li ◽  
Ryan Benton ◽  
Dongqi Li ◽  
...  

Abstract Background The breathing disorder obstructive sleep apnea syndrome (OSAS) only occurs while asleep. While polysomnography (PSG) represents the premiere standard for diagnosing OSAS, it is quite costly, complicated to use, and carries a significant delay between testing and diagnosis. Methods This work describes a novel architecture and algorithm designed to efficiently diagnose OSAS via the use of smart phones. In our algorithm, features are extracted from the data, specifically blood oxygen saturation as represented by SpO2. These features are used by a support vector machine (SVM) based strategy to create a classification model. The resultant SVM classification model can then be employed to diagnose OSAS. To allow remote diagnosis, we have combined a simple monitoring system with our algorithm. The system allows physiological data to be obtained from a smart phone, the data to be uploaded to the cloud for processing, and finally population of a diagnostic report sent back to the smart phone in real-time. Results Our initial evaluation of this algorithm utilizing actual patient data finds its sensitivity, accuracy, and specificity to be 87.6%, 90.2%, and 94.1%, respectively. Discussion Our architecture can monitor human physiological readings in real time and give early warning of abnormal physiological parameters. Moreover, after our evaluation, we find 5G technology offers higher bandwidth with lower delays ensuring more effective monitoring. In addition, we evaluate our algorithm utilizing real-world data; the proposed approach has high accuracy, sensitivity, and specific, demonstrating that our approach is very promising. Conclusions Experimental results on the apnea data in University College Dublin (UCD) Database have proven the efficiency and effectiveness of our methodology. This work is a pilot project and still under development. There is no clinical validation and no support. In addition, the Internet of Things (IoT) architecture enables real-time monitoring of human physiological parameters, combined with diagnostic algorithms to provide early warning of abnormal data.


Sign in / Sign up

Export Citation Format

Share Document