Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic

Paraphrase detection allows determining how original and suspect documents convey the same meaning. It has attracted attention from researchers in many Natural Language Processing (NLP) tasks such as plagiarism detection, question answering, information retrieval, etc., Traditional methods (e.g., Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA)) cannot capture efficiently hidden semantic relations when sentences may not contain any common words or the co-occurrence of words is rarely present. Therefore, we proposed a deep learning model based on Global Word embedding (GloVe) and Recurrent Convolutional Neural Network (RCNN). It was efficient for capturing more contextual dependencies between words vectors with precise semantic meanings. Seeing the lack of resources in Arabic language publicly available, we developed a paraphrased corpus automatically. It preserved syntactic and semantic structures of Arabic sentences using word2vec model and Part-Of-Speech (POS) annotation. Overall experiments shown that our proposed model outperformed the state-of-the-art methods in terms of precision and recall

Download Full-text

Part-of-Speech Tagging for Arabic Text using Particle Swarm Optimization and Genetic Algorithm

Recent Advances in Computer Science and Communications ◽

10.2174/2666255814666210114120558 ◽

2021 ◽

Vol 14 ◽

Author(s):

Ahmad T. Al-Taani ◽

Fadi A. ALkhazaaleh

Keyword(s):

Particle Swarm Optimization ◽

Language Processing ◽

Question Answering ◽

Particle Swarm ◽

Pso Algorithm ◽

Arabic Language ◽

Genetic Operators ◽

Swarm Optimization ◽

Pos Tagging ◽

Part Of Speech

Background: Part of Speech (POS) Tagging is a process of defining the suitable part of speech for each word in the given context such as defining if a word is a verb, a noun or a particle. POS tagging is an important preprocessing step in many Natural Language Processing (NLP) applications such as question answering, text summarization, and information retrieval. Objective: The performance of NLP applications depends on the accuracy of POS taggers since assigning right tags for the words in a sentence enables the application to work properly after tagging. Many approaches have been proposed for the Arabic language, but more investigations are needed to improve the efficiency of Arabic POS taggers. Method: In this study, we propose a supervised POS tagging system for the Arabic language using Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) as well as Hidden Markov Model (HMM). The tagging process is considered as an optimization problem and illustrated as a swarm which consists of group of particles. Each particle represents sequence of tags. The PSO algorithm is applied to find the best sequence of tags which represent the correct tags of the sentence. The genetic operators: crossover and mutation are used to find personal best, global best, and velocity of the PSO algorithm. HMM is used to find the fitness of particles in the swarm. Results : The performance of the proposed approach is evaluated on the KALIMAT dataset which consists of 18 million words and a tag set consists of 45 tags which covers all Arabic POS tags. The proposed tagger achieved an accuracy of 90.5%. Conclusion: Experimental results revealed that the proposed tagger achieved promising results compared to four existing approaches. Other approaches can identify only three tags: noun, verb and particle. Also, the accuracy for some tags are outperformed those achieved by other approaches.

Download Full-text

Designing a Chat-Bot for College Information using Information Retrieval and Automatic Text Summarization Techniques

Current Chinese Computer Science ◽

10.2174/2665997201999201022191540 ◽

2020 ◽

Vol 01 ◽

Author(s):

Radha Guha

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Text Summarization ◽

The Internet ◽

Specific Domain ◽

User Query ◽

College Information ◽

Chat Bot

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.

Download Full-text

BUILD KNOWLEDGE GRAPH FROM HETEROGENEOUS DOCUMENTS

Journal of Science and Technology - IUH ◽

10.46242/jst-iuh.v47i05.761 ◽

2021 ◽

Vol 47 (05) ◽

Author(s):

NGUYỄN CHÍ HIẾU

Keyword(s):

Information Retrieval ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Semantic Analysis ◽

Knowledge Graph ◽

Question Answering Systems ◽

Knowledge Graphs

Knowledge Graphs are applied in many fields such as search engines, semantic analysis, and question answering in recent years. However, there are many obstacles for building knowledge graphs as methodologies, data and tools. This paper introduces a novel methodology to build knowledge graph from heterogeneous documents. We use the methodologies of Natural Language Processing and deep learning to build this graph. The knowledge graph can use in Question answering systems and Information retrieval especially in Computing domain

Download Full-text

Events Automatic Extraction from Arabic Texts

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch078 ◽

2020 ◽

pp. 1686-1704

Author(s):

Emna Hkiri ◽

Souheyl Mallat ◽

Mounir Zrigui

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Machine Translation ◽

Language Processing ◽

Question Answering ◽

Arabic Language ◽

Event Extraction ◽

Mining Machine ◽

Open Domain

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.

Download Full-text

Open-Ended Questions

Employee Surveys and Sensing ◽

10.1093/oso/9780190939717.003.0013 ◽

2020 ◽

pp. 202-218

Author(s):

Subhadra Dutta ◽

Eric M. O’Rourke

Keyword(s):

Machine Learning ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Written Language ◽

Text Data ◽

Employee Survey ◽

Trade Offs ◽

Word Relatedness ◽

Survey Responses

Natural language processing (NLP) is the field of decoding human written language. This chapter responds to the growing interest in using machine learning–based NLP approaches for analyzing open-ended employee survey responses. These techniques address scalability and the ability to provide real-time insights to make qualitative data collection equally or more desirable in organizations. The chapter walks through the evolution of text analytics in industrial–organizational psychology and discusses relevant supervised and unsupervised machine learning NLP methods for survey text data, such as latent Dirichlet allocation, latent semantic analysis, sentiment analysis, word relatedness methods, and so on. The chapter also lays out preprocessing techniques and the trade-offs of growing NLP capabilities internally versus externally, points the readers to available resources, and ends with discussing implications and future directions of these approaches.

Download Full-text

A Persian Medical Question Answering System

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213020500190 ◽

2020 ◽

Vol 29 (06) ◽

pp. 2050019

Author(s):

Hadi Veisi ◽

Hamed Fakour Shandi

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

Document Retrieval ◽

Main Concept ◽

Question Answering System ◽

Detection Algorithms ◽

Persian Language ◽

Part Of Speech ◽

Answer Extraction

A question answering system is a type of information retrieval that takes a question from a user in natural language as the input and returns the best answer to it as the output. In this paper, a medical question answering system in the Persian language is designed and implemented. During this research, a dataset of diseases and drugs is collected and structured. The proposed system includes three main modules: question processing, document retrieval, and answer extraction. For the question processing module, a sequential architecture is designed which retrieves the main concept of a question by using different components. In these components, rule-based methods, natural language processing, and dictionary-based techniques are used. In the document retrieval module, the documents are indexed and searched using the Lucene library. The retrieved documents are ranked using similarity detection algorithms and the highest-ranked document is selected to be used by the answer extraction module. This module is responsible for extracting the most relevant section of the text in the retrieved document. During this research, different customized language processing tools such as part of speech tagger and lemmatizer are also developed for Persian. Evaluation results show that this system performs well for answering different questions about diseases and drugs. The accuracy of the system for 500 sample questions is 83.6%.

Download Full-text

Thematic Context Derivator Algorithm for Enhanced Context Vector Machine: eCVM

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b4564.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4872-4877

Keyword(s):

Language Processing ◽

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Named Entities ◽

Pagerank Algorithm ◽

Context Vector ◽

Improved Performance ◽

Evaluation Parameters ◽

Thematic Context

Natural Language Processing uses word embeddings to map words into vectors. Context vector is one of the techniques to map words into vectors. The context vector gives importance of terms in the document corpus. The derivation of context vector is done using various methods such as neural networks, latent semantic analysis, knowledge base methods etc. This paper proposes a novel system to devise an enhanced context vector machine called eCVM. eCVM is able to determine the context phrases and its importance. eCVM uses latent semantic analysis, existing context vector machine, dependency parsing, named entities, topics from latent dirichlet allocation and various forms of words like nouns, adjectives and verbs for building the context. eCVM uses context vector and Pagerank algorithm to find the importance of the term in document and is tested on BBC news dataset. Results of eCVM are compared with compared with the state of the art for context detrivation. The proposed system shows improved performance over existing systems for standard evaluation parameters.

Download Full-text

Development of Reading Comprehension System for Kannada Text Documents

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1008.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 42-45

Keyword(s):

Reading Comprehension ◽

Natural Language ◽

Language Processing ◽

Primary Language ◽

Grammatical Structure ◽

Text Documents ◽

Inverse Document Frequency ◽

Document Frequency ◽

Proposed Model ◽

The Given

Reading Comprehension (RC) plays an important role in Natural Language Processing (NLP) as it reads and understands text written in Natural Language. Reading Comprehension systems comprehend the given document and answer questions in the context of the given document. This paper proposes a Reading Comprehension System for Kannada documents. The RC system analyses text in the Kannada script and allows users to pose questions to It in Kannada. This system is aimed at masses whose primary language is Kannada - who would otherwise have difficulties in parsing through vast Kannada documents for the information they require. This paper discusses the proposed model built using Term Frequency - Inverse Document Frequency (TF-IDF) and its performance in extracting the answers from the context document. The proposed model captures the grammatical structure of Kannada to provide the most accurate answers to the user

Download Full-text

Towards relation extraction from Arabic text: a review

International Robotics & Automation Journal ◽

10.15406/iratj.2019.05.00195 ◽

2019 ◽

Vol 5 (5) ◽

pp. 212-215

Author(s):

Abeer AlArfaj

Keyword(s):

Language Processing ◽

Question Answering ◽

State Of The Art ◽

Relation Extraction ◽

Arabic Language ◽

Extraction Methods ◽

The State ◽

Semantic Relations ◽

Arabic Text ◽

Taxonomic Relation

Semantic relation extraction is an important component of ontologies that can support many applications e.g. text mining, question answering, and information extraction. However, extracting semantic relations between concepts is not trivial and one of the main challenges in Natural Language Processing (NLP) Field. The Arabic language has complex morphological, grammatical, and semantic aspects since it is a highly inflectional and derivational language, which makes task even more challenging. In this paper, we present a review of the state of the art for relation extraction from texts, addressing the progress and difficulties in this field. We discuss several aspects related to this task, considering the taxonomic and non-taxonomic relation extraction methods. Majority of relation extraction approaches implement a combination of statistical and linguistic techniques to extract semantic relations from text. We also give special attention to the state of the work on relation extraction from Arabic texts, which need further progress.

Download Full-text

Supervised attention for answer selection in community question answering

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v9.i2.pp203-211 ◽

2020 ◽

Vol 9 (2) ◽

pp. 203

Author(s):

Thanh Thi Ha ◽

Atsuhiro Takasu ◽

Thanh Chinh Nguyen ◽

Kiem Hieu Nguyen ◽

Van Nha Nguyen ◽

...

Keyword(s):

Language Processing ◽

Question Answering ◽

Irrelevant Information ◽

Social Question ◽

Community Question Answering ◽

Basic Model ◽

Proposed Model ◽

Questions And Answers ◽

Word Attention ◽

Better Than

Answer selection is an important task in Community Question Answering (CQA). In recent years, attention-based neural networks have been extensively studied in various natural language processing problems, including question answering. This paper explores matchLSTM for answer selection in CQA. A lexical gap in CQA is more challenging as questions and answers typical contain multiple sentences, irrelevant information, and noisy expressions. In our investigation, word-by-word attention in the original model does not work well on social question-answer pairs. We propose integrating supervised attention into matchLSTM. Specifically, we leverage lexical-semantic from external to guide the learning of attention weights for question-answer pairs. The proposed model learns more meaningful attention that allows performing better than the basic model. Our performance is among the top on SemEval datasets.

Download Full-text