scholarly journals An Extension of the VSM Documents Representation using Word Embedding

2017 ◽  
Vol 2 (1) ◽  
pp. 249-257
Author(s):  
Daniel Morariu ◽  
Lucian Vințan ◽  
Radu Crețulescu

Abstract In this paper, we will present experiments that try to integrate the power of Word Embedding representation in real problems for documents classification. Word Embedding is a new tendency used in the natural language processing domain that tries to represent each word from the document in a vector format. This representation embeds the semantically context in that the word occurs more frequently. We include this new representation in a classical VSM document representation and evaluate it using a learning algorithm based on the Support Vector Machine. This new added information makes the classification to be more difficult because it increases the learning time and the memory needed. The obtained results are slightly weaker comparatively with the classical VSM document representation. By adding the WE representation to the classical VSM representation we want to improve the current educational paradigm for the computer science students which is generally limited to the VSM representation.

Author(s):  
Chaudhary Jashubhai Rameshbhai ◽  
Joy Paulose

<p>Opinion Mining also known as Sentiment Analysis, is a technique or procedure which uses Natural Language processing (NLP) to classify the outcome from text. There are various NLP tools available which are used for processing text data. Multiple research have been done in opinion mining for online blogs, Twitter, Facebook etc. This paper proposes a new opinion mining technique using Support Vector Machine (SVM) and NLP tools on newspaper headlines. Relative words are generated using Stanford CoreNLP, which is passed to SVM using count vectorizer. On comparing three models using confusion matrix, results indicate that Tf-idf and Linear SVM provides better accuracy for smaller dataset. While for larger dataset, SGD and linear SVM model outperform other models.</p>


2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Alex Mathew

The paper reviews how human-centered artificial intelligence and security primitive have influenced life in the modern world and how it’s useful in the future. Human-centered A.I. has enhanced our capabilities by the way of intelligence, human informed technology. It has created a technology that has made machines and computer intelligently carry their function. The security primitive has enhanced the safety of the data and increased accessibility of data from anywhere regardless of the password is known. This has improved personalized customer activities and filled the gap between the human-machine. This has been successful due to the usage of heuristics which solve belowems by experimental, support vector machine which evaluates and group the data, natural language processing systems which change speech to language. The results of this will lead to image recognition, games, speech recognition, translation, and answering questions. In conclusion, human-centered A.I. and security primitives is an advanced mode of technology that uses statistical mathematical models that provides tools to perform certain work. The results keep on advancing and spreading with years and it will be common in our lives.


2021 ◽  
Vol 8 (6) ◽  
pp. 1265
Author(s):  
Muhammad Alkaff ◽  
Andreyan Rizky Baskara ◽  
Irham Maulani

<p>Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%.</p><p> </p><p><em><strong>Abstract</strong></em></p><p><em>A service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score</em></p><p><em><strong><br /></strong></em></p>


Detecting the author of the sentence in a collective document can be done by choosing a suitable set of features and implementing using Natural Language Processing in Machine Learning. Training our machine is the basic idea to identify the author name of a specific sentence. This can be done by using 8 different NLP steps like applying stemming algorithm, finding stop-list words, preprocessing the data, and then applying it to a machine learning classifier-Support vector machine (SVM) which classify the dataset into a number of classes specifying the author of the sentence and defines the name of author for each and every sentence with an accuracy of 82%.This paper helps the readers who are interested in knowing the names of the authors who have written some specific words


SAINTEKBU ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 40-44
Author(s):  
Iin Kurniasari

Facebook adalah salah satu media sosial yang sering digunakan. Terutama pada pandemi co-19 saat ini. Banyak sekali sentimen publik yang beredar, terutama di Facebook dalam bentuk komentar atas informasi yang ada tentang covid-19 yang menantang untuk dianalisis untuk beberapa tujuan. Teknik NLP (Natural Language Processing) yang terdiri dari casefolding, tokenizing, filtering dan stemming dapat digunakan dalam kasus ini. Studi ini berfokus pada pengembangan analisis sentimen di Facebook menggunakan Lexicon dan Support Vector Machine. Data Lexicon yang diperoleh memiliki akurasi lebih rendah daripada menggunakan Support Vector Machine.


2020 ◽  
Vol 132 (4) ◽  
pp. 738-749 ◽  
Author(s):  
Michael L. Burns ◽  
Michael R. Mathis ◽  
John Vandervest ◽  
Xinyu Tan ◽  
Bo Lu ◽  
...  

Abstract Background Accurate anesthesiology procedure code data are essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques, including machine learning and natural language processing, offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures. Methods Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of at least 95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services (Baltimore, Maryland) fee-for-service accuracy. Actual submitted claim data from billing specialists were used as a reference standard. Results Support vector machine and neural network label-embedding attentive models were the best performing models, respectively, demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text. Conclusions Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes. Editor’s Perspective What We Already Know about This Topic What This Article Tells Us That Is New


2013 ◽  
Vol 427-429 ◽  
pp. 2572-2575
Author(s):  
Xiao Hua Li ◽  
Shu Xian Liu

This article provides a brief introduction to Natural Language Processing and basic knowledge of Machine Learning and Support Vector Machine at first, and then, gives a more detailed introduction about how to use SVM models in several major directions about NLP, and at the end, a brief summary about the application of SVM in Natural Language Processing is given.


2018 ◽  
Vol 19 (1) ◽  
pp. 61-79
Author(s):  
Yu-Yun Chang ◽  
Shu-Kai Hsieh

Abstract In Generative Lexicon Theory (glt) (Pustejovsky 1995), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as examples to illustrate how their arguments co-specify the verb with qualia unification. Some studies (Blutner 2002; Carston 2002; Falkum 2007) stated that the information of pragmatics and world knowledge need to be considered as well. Therefore, this study would like to examine whether glt could be practiced in a real-world Natural Language Processing (nlp) application using collocations. We have conducted a fine-grained logical polysemy disambiguation task, taking the open-sourced Leiden Weibo Corpus as resource and computing with Support Vector Machine (svm) classifier. Within the classifier, we have taken collocated verbs under glt as main features. In addition, measure words and syntactic patterns are extracted as additional features for comparison. Our study investigates the logical polysemy of the Chinese verb kao ‘bake’. We find that glt could help in identifying logically polysemous cases; additional features would help the classifier achieve a higher performance.


2020 ◽  
pp. 016555152093091
Author(s):  
Saeed-Ul Hassan ◽  
Aneela Saleem ◽  
Saira Hanif Soroya ◽  
Iqra Safder ◽  
Sehrish Iqbal ◽  
...  

The purpose of the study is to (a) contribute to annotating an Altmetrics dataset across five disciplines, (b) undertake sentiment analysis using various machine learning and natural language processing–based algorithms, (c) identify the best-performing model and (d) provide a Python library for sentiment analysis of an Altmetrics dataset. First, the researchers gave a set of guidelines to two human annotators familiar with the task of related tweet annotation of scientific literature. They duly labelled the sentiments, achieving an inter-annotator agreement (IAA) of 0.80 (Cohen’s Kappa). Then, the same experiments were run on two versions of the dataset: one with tweets in English and the other with tweets in 23 languages, including English. Using 6388 tweets about 300 papers indexed in Web of Science, the effectiveness of employed machine learning and natural language processing models was measured by comparing with well-known sentiment analysis models, that is, SentiStrength and Sentiment140, as the baseline. It was proved that Support Vector Machine with uni-gram outperformed all the other classifiers and baseline methods employed, with an accuracy of over 85%, followed by Logistic Regression at 83% accuracy and Naïve Bayes at 80%. The precision, recall and F1 scores for Support Vector Machine, Logistic Regression and Naïve Bayes were (0.89, 0.86, 0.86), (0.86, 0.83, 0.80) and (0.85, 0.81, 0.76), respectively.


Sign in / Sign up

Export Citation Format

Share Document