word representation
Recently Published Documents


TOTAL DOCUMENTS

206
(FIVE YEARS 113)

H-INDEX

15
(FIVE YEARS 3)

2022 ◽  
Vol 16 (4) ◽  
pp. 1-30
Author(s):  
Muhammad Abulaish ◽  
Mohd Fazil ◽  
Mohammed J. Zaki

Domain-specific keyword extraction is a vital task in the field of text mining. There are various research tasks, such as spam e-mail classification, abusive language detection, sentiment analysis, and emotion mining, where a set of domain-specific keywords (aka lexicon) is highly effective. Existing works for keyword extraction list all keywords rather than domain-specific keywords from a document corpus. Moreover, most of the existing approaches perform well on formal document corpuses but fail on noisy and informal user-generated content in online social media. In this article, we present a hybrid approach by jointly modeling the local and global contextual semantics of words, utilizing the strength of distributional word representation and contrasting-domain corpus for domain-specific keyword extraction. Starting with a seed set of a few domain-specific keywords, we model the text corpus as a weighted word-graph. In this graph, the initial weight of a node (word) represents its semantic association with the target domain calculated as a linear combination of three semantic association metrics, and the weight of an edge connecting a pair of nodes represents the co-occurrence count of the respective words. Thereafter, a modified PageRank method is applied to the word-graph to identify the most relevant words for expanding the initial set of domain-specific keywords. We evaluate our method over both formal and informal text corpuses (comprising six datasets), and show that it performs significantly better in comparison to state-of-the-art methods. Furthermore, we generalize our approach to handle the language-agnostic case, and show that it outperforms existing language-agnostic approaches.


Author(s):  
Tomoya TACHIBANA ◽  
Koki SHODA ◽  
Aiza Syamimi Binti Abd Rani ◽  
Yutaka NOMAGUCHI ◽  
Kazuya OKAMOTO ◽  
...  

2021 ◽  
pp. 001698622110618
Author(s):  
Selcuk Acar ◽  
Kelly Berthiaume ◽  
Katalin Grajzel ◽  
Denis Dumas ◽  
Charles “Tedd” Flemister ◽  
...  

In this study, we applied different text-mining methods to the originality scoring of the Unusual Uses Test (UUT) and Just Suppose Test (JST) from the Torrance Tests of Creative Thinking (TTCT)–Verbal. Responses from 102 and 123 participants who completed Form A and Form B, respectively, were scored using three different text-mining methods. The validity of these scoring methods was tested against TTCT’s manual-based scoring and a subjective snapshot scoring method. Results indicated that text-mining systems are applicable to both UUT and JST items across both forms and students’ performance on those items can predict total originality and creativity scores across all six tasks in the TTCT-Verbal. Comparatively, the text-mining methods worked better for UUT than JST. Of the three text-mining models we tested, the Global Vectors for Word Representation (GLoVe) model produced the most reliable and valid scores. These findings indicate that creativity assessment can be done quickly and at a lower cost using text-mining approaches.


Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 352
Author(s):  
Ke Zhao ◽  
Lan Huang ◽  
Rui Song ◽  
Qiang Shen ◽  
Hao Xu

Short text classification is an important problem of natural language processing (NLP), and graph neural networks (GNNs) have been successfully used to solve different NLP problems. However, few studies employ GNN for short text classification, and most of the existing graph-based models ignore sequential information (e.g., word orders) in each document. In this work, we propose an improved sequence-based feature propagation scheme, which fully uses word representation and document-level word interaction and overcomes the limitations of textual features in short texts. On this basis, we utilize this propagation scheme to construct a lightweight model, sequential GNN (SGNN), and its extended model, ESGNN. Specifically, we build individual graphs for each document in the short text corpus based on word co-occurrence and use a bidirectional long short-term memory network (Bi-LSTM) to extract the sequential features of each document; therefore, word nodes in the document graph retain contextual information. Furthermore, two different simplified graph convolutional networks (GCNs) are used to learn word representations based on their local structures. Finally, word nodes combined with sequential information and local information are incorporated as the document representation. Extensive experiments on seven benchmark datasets demonstrate the effectiveness of our method.


The goal of dependency parsing is to seek a functional relationship among words. For instance, it tells the subject-object relation in a sentence. Parsing the Indonesian language requires information about the morphology of a word. Indonesian grammar relies heavily on affixation to combine root words with affixes to form another word. Thus, morphology information should be incorporated. Fortunately, it can be encoded implicitly by word representation. Embeddings from Language Models (ELMo) is a word representation which be able to capture morphology information. Unlike most widely used word representations such as word2vec or Global Vectors (GloVe), ELMo utilizes a Convolutional Neural Network (CNN) over characters. With it, the affixation process could ideally encoded in a word representation. We did an analysis using nearest neighbor words and T-distributed Stochastic Neighbor Embedding (t-SNE) word visualization to compare word2vec and ELMo. Our result showed that ELMo representation is richer in encoding the morphology information than it's counterpart. We trained our parser using word2vec and ELMo. To no surprise, the parser which uses ELMo gets a higher accuracy than word2vec. We obtain Unlabeled Attachment Score (UAS) at 83.08 for ELMo and 81.35 for word2vec. Hence, we confirmed that morphology information is necessary, especially in a morphologically rich language like Indonesian. Keywords: ELMo, Dependency Parser, Natural Language Processing, word2vec


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Fazeel Abid ◽  
Ikram Ud Din ◽  
Ahmad Almogren ◽  
Hasan Ali Khattak ◽  
Mirza Waqar Baig

Deep learning-based methodologies are significant to perform sentiment analysis on social media data. The valuable insights of social media data through sentiment analysis can be employed to develop intelligent applications. Among many networks, convolution neural networks (CNNs) are widely used in many conventional text classification tasks and perform a significant role. However, to capture long-term contextual information and address the detail loss problem, CNNs require stacking multiple convolutional layers. Also, the stacking of convolutional layers has issues requiring massive computations and the tuning of additional parameters. To solve these problems, in this paper, a contextualized concatenated word representation (CCWRs) is initialized from social media data based on text which is essential to misspelled and out of vocabulary words (OOV). In CCWRs, different word representation models, for example, Word2Vec, its optimized version FastText and Global Vectors, and GloVe, collectively create contextualized representations upon the sequence of input. Second, a three-layered dilated convolutional neural network (3D-CNN) is proposed that places dilated convolution kernels instead of conventional CNN kernels. Incorporating the extension in the receptive field’s size successfully solves the detail loss problem and achieves long-term context information with different dilation rates. Experiments on datasets demonstrate that the proposed framework achieves reliable results with the selection of numerous hyperparameter tuning and configurations for improved optimization leads to reduced computational resources and reliable accuracy.


2021 ◽  
Vol 2082 (1) ◽  
pp. 012019
Author(s):  
Hongming Dai

Abstract Parsing natural language to corresponding programming language attracts much attention in recent years. Natural Language to SQL(NL2SQL) widely appears in numerous practical Internet applications. Previous solution was to convert the input as a heterogeneous graph which failed to learn good word representation in question utterance. In this paper, we propose a Relation-Aware framework named LinGAN, which has powerful semantic parsing abilities and can jointly encode the question utterance and syntax information of the object language. We also propose the pre-norm residual shrinkage unit to solve the problem of deep degradation of Linformer. Experiments show that LinGAN achieves excellent performance on multiple code generation tasks.


2021 ◽  
Vol 8 (5) ◽  
pp. 1067
Author(s):  
Yuliska Yuliska ◽  
Dini Hidayatul Qudsi ◽  
Juanda Hakim Lubis ◽  
Khairul Umum Syaliman ◽  
Nina Fadilah Najwa

<p class="Abstrak"><em>Review</em> atau saran dari <em>customer</em> dapat menjadi sangat penting bagi penyedia layanan, begitu pula saran dari mahasiswa mengenai layanan sebuah unit kerja di perguruan tinggi. <em>Review</em> menjadi penting karena dapat menjadi indikator kinerja penyedia layanan. Pengolahan review juga sangat penting karena dapat menjadi referensi untuk pengambilan keputusan dan peningkatan layanan yang lebih baik ke depannya. Penelitian ini menerapkan analisis sentimen pada data saran atau <em>review</em> mahasiswa terhadap kinerja unit kerja atau departemen di perguruan tinggi, yaitu Politeknik Caltex Riau. Analisis sentimen dilakukan dengan menggunakan <em>Convolutional Neural Network (CNN)</em> dan <em>word embedding</em> <em>Word2vec</em> sebagai representasi kata. <em>CNN</em> merupakan metode yang memiliki performa yang baik dalam mengklasifikasi teks, yaitu dengan teknik <em>convolutional</em> yang menggabungkan beberapa <em>window</em> kata pada kalimat dan mengambil <em>window</em> yang paling <em>representative</em>. <em>Word2Vec</em> digunakan sebagai representasi data saran dan inputan awal pada <em>CNN</em>, dimana <em>Word2Vec</em> merupakan <em>dense vectors</em> yang dapat merepresentasikan hubungan antar kata pada data saran dengan baik. Saran mahasiswa dapat mengandung kalimat yang sangat panjang, karena itu perpaduan <em>Word2Vec</em> sebagai representasi kata dan <em>CNN</em> dengan teknik <em>convolutional</em>, dapat menghasilkan representasi yang <em>representative</em> dari kalimat panjang tersebut. Penelitian ini menggunakan dua arsitektur <em>CNN</em>, yaitu <em>Simple</em> <em>CNN</em> dan <em>DoubleMax CNN</em> untuk mengidentifikasi pengaruh kompleksitas arsitektur terhadap hasil klasifikasi sentimen.  Berdasarkan hasil pengujian, <em>DoubleMax CNN</em> dapat mengklasifikasi sentimen pada saran mahasiswa dengan sangat baik, yaitu mencapai Akurasi tertinggi sebesar 98%, <em>Recall</em> 97%, <em>Precision</em> 98% dan <em>F1-Score</em> 98%.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Student’s reviews about department performance can be essential for a college for it can be used to evaluate the department performance and to take an immediate action to improve its performance. This research applies sentiment analysis in the student’s reviews of college department in Politeknik Caltex Riau. Convolutional Neural Network and Word2Vec are employed to analyze the sentiment. CNN is known for its good performance in text classification by applying a convolutional technique to the input sentences. Word2Vec is used as word representation and as an input to the CNN. Word2Vec are dense vectors which can represent the relationship between words excellently. Student’s reviews can be a long sentence; hence the combination of Word2Vec as word representation and CNN with convolutional technique can produce a representative fiture from that long sentence. This research utilizes two CNN architectures, which are Simple CNN dan DoubleMax CNN to identify the effect of the complexity of CNN architecture to final result. Our experiments show that DoubleMax CNN has a great performance in classifying sentiment in the student’s reviews with the best Accuracy value of 98%, Recall 97%, Precision 98% and F1-Score value of 98%.<strong> </strong></em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


Sign in / Sign up

Export Citation Format

Share Document