Hybrid Segmentation Prototype for Arabic Text-Based Documents

2015 ◽

Vol 6 (1) ◽

pp. 63-74 ◽

Cited By ~ 3

Author(s):

Sonia Alouane-Ksouri ◽

Minyar Sassi Hidri

Keyword(s):

Morphological Analysis ◽

Matrix Representation ◽

Document Analysis ◽

Arabic Text ◽

Computation Model ◽

Document Similarity ◽

Similarity Computation ◽

Hybrid Segmentation ◽

Abstraction Levels ◽

Processing Steps

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between the documents of an Arabic corpus. It involves two segmentation systems and a morphological analysis in order to obtain a matrix representation adapted to the triadic similarity computation according to three abstraction levels: documents, sentences and words.

Download Full-text

A Document Similarity Computation Method Based on Word Embedding and Citation Analysis

Advances in Intelligent Systems and Computing - Recent Findings in Intelligent Computing Techniques ◽

10.1007/978-981-10-8633-5_17 ◽

2018 ◽

pp. 161-168

Author(s):

K. Lamiya ◽

Anuraj Mohan

Keyword(s):

Citation Analysis ◽

Word Embedding ◽

Computation Method ◽

Document Similarity ◽

Similarity Computation

Download Full-text

A Chinese short text semantic similarity computation model based on stop words and TongyiciCilin

2017 6th International Conference on Computer Science and Network Technology (ICCSNT) ◽

10.1109/iccsnt.2017.8343708 ◽

2017 ◽

Author(s):

Tang Shancheng ◽

Bai Yunyue ◽

Ma Fuyu

Keyword(s):

Semantic Similarity ◽

Computation Model ◽

Short Text ◽

Model Based ◽

Similarity Computation

Download Full-text

Computation Model of Case Similarity Based on Uneven Weight Distance Coefficient

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3965 ◽

2010 ◽

Vol 44-47 ◽

pp. 3965-3969

Author(s):

Shou Ming Hou ◽

Li Juan He ◽

Wen Peng Xu ◽

Hua Tao Fan ◽

Zhong Qi Sheng

Keyword(s):

Conceptual Design ◽

Rapid Response ◽

Computational Method ◽

Structure Model ◽

Computation Model ◽

Attribute Weight ◽

Retrieval Problem ◽

Similarity Computation ◽

Rolling Guide ◽

Distance Coefficient

In order to solve the retrieval problem of design example in rapid response design, similarity computation model among design examples is given, computation of attribute weight of design example is done based on the method of uneven weight distance coefficient, computation of similarity between product structure model that meet the requirement of customers and design examples existed is done by the method of combination weight. This method has been applied in rolling guide conceptual design based on example inference, the result indicated that the weight computational method proposed is simple and the reliability is good, it can solve the intelligent retrieval problem of design examples in rapid response design.

Download Full-text

Sentence Similarity Calculation Based on Probabilistic Tolerance Rough Sets

Mathematical Problems in Engineering ◽

10.1155/2021/1635708 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Ruiteng Yan ◽

Dong Qiu ◽

Haihuan Jiang

Keyword(s):

Language Processing ◽

Rough Set ◽

Time Complexity ◽

Computation Model ◽

Text Data ◽

Efficient Performance ◽

Sentence Similarity ◽

Similarity Calculation ◽

Tolerance Rough Set ◽

Similarity Computation

Sentence similarity calculation is one of the important foundations of natural language processing. The existing sentence similarity calculation measurements are based on either shallow semantics with the limitation of inadequately capturing latent semantics information or deep learning algorithms with the limitation of supervision. In this paper, we improve the traditional tolerance rough set model, with the advantages of lower time complexity and becoming incremental compared to the traditional one. And then we propose a sentence similarity computation model from the perspective of uncertainty of text data based on the probabilistic tolerance rough set model. It has the ability of mining latent semantics information and is unsupervised. Experiments on SICK2014 task and STSbenchmark dataset to calculate sentence similarity identify a significant and efficient performance of our model.

Download Full-text

The Morphological Analysis of Zulu Clan Names

English Linguistics Research ◽

10.5430/elr.v9n3p36 ◽

2020 ◽

Vol 9 (3) ◽

pp. 36

Author(s):

Celani Lucky Zwane

Keyword(s):

Morphological Analysis ◽

Morphological Structure ◽

Document Analysis ◽

Famous People ◽

The People ◽

Research Findings

The focus of this paper is that some scholars and people are not aware of the morphological structure of Zulu clan names. The clan names in themselves cipher secreted information that would be a story, history, a very long story perhaps which talks about the people of that clan, it could be Kings, famous people or a whole family. The main aim of the paper is to make people aware of the morphological structure of Zulu clan names. Research findings indicate that there is morphological structure in Zulu clan names that most scholars and Zulu people are not aware of. This study found that the structure of a clan name and its meaning are related. An example of such a clan name is Hlabangane (slaughter four); [Hlaba (slaughter) + nga (per) + -ne (four)], which indicates that the clan name giver saw people of this clan slaughtering four cows when they had traditional ceremonies. However, through the use of this clan name, the clan name giver appears as a person who experienced or observed Hlabangane people repeating the same procedure several times and no one disagreed with him because it was a fact. The researcher have used document analysis and in depth personal interviews to gather data for this paper.

Download Full-text

Neural Arabic Text Diacritization: State-of-the-Art Results and a Novel Approach for Arabic NLP Downstream Tasks

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3470849 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-25

Author(s):

Ali Fadel ◽

Ibraheem Tuffaha ◽

Mahmoud Al-Ayyoub

Keyword(s):

Neural Network ◽

State Of The Art ◽

Conditional Random Field ◽

Arabic Text ◽

Learning Models ◽

Post Processing ◽

Feed Forward Neural Network ◽

Novel Approach ◽

Normalized Gradient ◽

Processing Steps

In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF), and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models even those requiring human-crafted language-dependent post-processing steps, unlike ours. Moreover, we show how diacritics in Arabic can be used to enhance the models of downstream NLP tasks such as Machine Translation (MT) and Sentiment Analysis (SA) by proposing novel Translation over Diacritization (ToD) and Sentiment over Diacritization (SoD) approaches.

Download Full-text

A New Method for Arabic Text Detection in Natural Scene Images

International Journal of Image and Graphics ◽

10.1142/s0219467823500109 ◽

2021 ◽

Author(s):

Houda Gaddour ◽

Slim Kanoun ◽

Nicole Vincent

Keyword(s):

Color Stability ◽

Spatial Relations ◽

Document Analysis ◽

Text Detection ◽

New Method ◽

Arabic Text ◽

Natural Scene ◽

Script Identification ◽

Homogeneous Regions ◽

Natural Scene Images

Text in scene images can provide useful and vital information for content-based image analysis. Therefore, text detection and script identification in images are an important task. In this paper, we propose a new method for text detection in natural scene images, particularly for Arabic text, based on a bottom-up approach where four principal steps can be highlighted. The detection of extremely stable and homogeneous regions of interest (ROIs) is based on the Color Stability and Homogeneity Regions (CSHR) proposed technique. These regions are then labeled as textual or non-textual ROI. This identification is based on a structural approach. The textual ROIs are grouped to constitute zones according to spatial relations between them. Finally, the textual or non-textual nature of the constituted zones is refined. This last identification is based on handcrafted features and on features built from a Convolutional Neural Network (CNN) after learning. The proposed method was evaluated on the databases used for text detection in natural scene images: the competitions organized in 2017 edition of the International Conference on Document Analysis and Recognition (ICDAR2017), the Urdu-text database and our Natural Scene Image Database for Arabic Text detection (NSIDAT) database. The obtained experimental results seem to be interesting.

Download Full-text