Abstract
In this paper, we present our approach to improve the performance of open-domain Arabic Question Answering systems. We focus on the passage retrieval phase which aims to retrieve the most related passages to the correct answer.
To extract passages that are related to the question, the system passes through three phases: Question Analysis, Document Retrieval and Passage Retrieval. We define the passage as the sentence that ends with a dot ".". In the Question Processing phase, we applied the traditional NLP steps of tokenization, stopwords and unrelated symbols removal, and replacing the question words with their stems. We also applied Query Expansion by adding synonyms to the question words. In the Document Retrieval phase, we used the Vector Space Model (VSM) with TF-IDF vectorizer and cosine similarity. For the Passage Retrieval phase, which is the core of our system, we measured the similarity between passages and the question by a combination of the BM25 ranker and Word Embedding approach.
We tested our system on ACRD dataset, which contains 1395 questions in different domains, and the system was able to achieve correct results with a precision of 92.2% and recall of 79.9% in finding the top-3 related passages for the query.