Hybrid Segmentation Prototype for Arabic Text-Based Documents

Author(s):  
Sonia Alouane-Ksouri ◽  
Minyar Sassi Hidri

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between the documents of an Arabic corpus. It involves two segmentation systems and a morphological analysis in order to obtain a matrix representation adapted to the triadic similarity computation according to three abstraction levels: documents, sentences and words.

Author(s):  
Sonia Alouane-Ksouri ◽  
Minyar Sassi Hidri

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between the documents of an Arabic corpus. It involves two segmentation systems and a morphological analysis in order to obtain a matrix representation adapted to the triadic similarity computation according to three abstraction levels: documents, sentences and words.


2010 ◽  
Vol 44-47 ◽  
pp. 3965-3969
Author(s):  
Shou Ming Hou ◽  
Li Juan He ◽  
Wen Peng Xu ◽  
Hua Tao Fan ◽  
Zhong Qi Sheng

In order to solve the retrieval problem of design example in rapid response design, similarity computation model among design examples is given, computation of attribute weight of design example is done based on the method of uneven weight distance coefficient, computation of similarity between product structure model that meet the requirement of customers and design examples existed is done by the method of combination weight. This method has been applied in rolling guide conceptual design based on example inference, the result indicated that the weight computational method proposed is simple and the reliability is good, it can solve the intelligent retrieval problem of design examples in rapid response design.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Ruiteng Yan ◽  
Dong Qiu ◽  
Haihuan Jiang

Sentence similarity calculation is one of the important foundations of natural language processing. The existing sentence similarity calculation measurements are based on either shallow semantics with the limitation of inadequately capturing latent semantics information or deep learning algorithms with the limitation of supervision. In this paper, we improve the traditional tolerance rough set model, with the advantages of lower time complexity and becoming incremental compared to the traditional one. And then we propose a sentence similarity computation model from the perspective of uncertainty of text data based on the probabilistic tolerance rough set model. It has the ability of mining latent semantics information and is unsupervised. Experiments on SICK2014 task and STSbenchmark dataset to calculate sentence similarity identify a significant and efficient performance of our model.


2020 ◽  
Vol 9 (3) ◽  
pp. 36
Author(s):  
Celani Lucky Zwane

The focus of this paper is that some scholars and people are not aware of the morphological structure of Zulu clan names. The clan names in themselves cipher secreted information that would be a story, history, a very long story perhaps which talks about the people of that clan, it could be Kings, famous people or a whole family. The main aim of the paper is to make people aware of the morphological structure of Zulu clan names. Research findings indicate that there is morphological structure in Zulu clan names that most scholars and Zulu people are not aware of. This study found that the structure of a clan name and its meaning are related. An example of such a clan name is Hlabangane (slaughter four); [Hlaba (slaughter) + nga (per) + -ne (four)], which indicates that the clan name giver saw people of this clan slaughtering four cows when they had traditional ceremonies. However, through the use of this clan name, the clan name giver appears as a person who experienced or observed Hlabangane people repeating the same procedure several times and no one disagreed with him because it was a fact. The researcher have used document analysis and in depth personal interviews to gather data for this paper.


Author(s):  
Ali Fadel ◽  
Ibraheem Tuffaha ◽  
Mahmoud Al-Ayyoub

In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF), and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models even those requiring human-crafted language-dependent post-processing steps, unlike ours. Moreover, we show how diacritics in Arabic can be used to enhance the models of downstream NLP tasks such as Machine Translation (MT) and Sentiment Analysis (SA) by proposing novel Translation over Diacritization (ToD) and Sentiment over Diacritization (SoD) approaches.


Author(s):  
Houda Gaddour ◽  
Slim Kanoun ◽  
Nicole Vincent

Text in scene images can provide useful and vital information for content-based image analysis. Therefore, text detection and script identification in images are an important task. In this paper, we propose a new method for text detection in natural scene images, particularly for Arabic text, based on a bottom-up approach where four principal steps can be highlighted. The detection of extremely stable and homogeneous regions of interest (ROIs) is based on the Color Stability and Homogeneity Regions (CSHR) proposed technique. These regions are then labeled as textual or non-textual ROI. This identification is based on a structural approach. The textual ROIs are grouped to constitute zones according to spatial relations between them. Finally, the textual or non-textual nature of the constituted zones is refined. This last identification is based on handcrafted features and on features built from a Convolutional Neural Network (CNN) after learning. The proposed method was evaluated on the databases used for text detection in natural scene images: the competitions organized in 2017 edition of the International Conference on Document Analysis and Recognition (ICDAR2017), the Urdu-text database and our Natural Scene Image Database for Arabic Text detection (NSIDAT) database. The obtained experimental results seem to be interesting.


Sign in / Sign up

Export Citation Format

Share Document