Supervised Learning to Measure the Semantic Similarity Between Arabic Sentences

Author(s):  
Wafa Wali ◽  
Bilel Gargouri ◽  
Abdelmajid Ben hamadou
2014 ◽  
Vol 6 (2) ◽  
pp. 46-51
Author(s):  
Galang Amanda Dwi P. ◽  
Gregorius Edwadr ◽  
Agus Zainal Arifin

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group.  Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Rita T. Sousa ◽  
Sara Silva ◽  
Catia Pesquita

Abstract Background In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. Results We have developed a novel approach, evoKGsim, that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The approach was evaluated on several benchmark datasets for protein-protein interaction prediction using the Gene Ontology as the knowledge graph to support semantic similarity, and it outperformed competing strategies, including manually selected combinations of semantic aspects emulating expert knowledge. evoKGsim was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting protein-protein interactions for species with fewer known interactions. Conclusions evoKGsim can overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications.


2020 ◽  
Author(s):  
Junyi Li ◽  
Xuejie Zhang ◽  
Xiaobing Zhou

BACKGROUND In recent years, with the increase in the amount of information and the importance of information screening, increasing attention has been paid to the calculation of textual semantic similarity. In the medical field, with the rapid increase in electronic medical data, electronic medical records and medical research documents have become important data resources for medical clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved. The 2019 N2C2/OHNLP shared task Track on Clinical Semantic Textual Similarity is one of significant tasks for medical textual semantic similarity calculation. OBJECTIVE This research aims to solve two problems: 1) The size of medical datasets is small, which leads to the problem of insufficient learning with understanding of the models; 2) The data information will be lost in the process of long-distance propagation, which causes the models to be unable to grasp key information. METHODS This paper combines a text data augmentation method and a self-ensemble ALBERT model under semi-supervised learning to perform clinical textual semantic similarity calculation. RESULTS Compared with the competition methods the 2019 N2C2/OHNLP Track 1 ClinicalSTS, our method achieves state-of-the-art result with a value 0.92 of the Pearson correlation coefficient and surpasses the best result by 2 percentage point. CONCLUSIONS When the size of medical dataset is small, data augmentation and improved semi-supervised learning can increase the size of dataset and boost the learning efficiency of the model. Additionally, self-ensemble improves the model performance significantly. Through the results, we can know that our method has excellent performance and it has great potential to improve related medical problems. CLINICALTRIAL


2018 ◽  
Vol 2018 (15) ◽  
pp. 132-1-1323
Author(s):  
Shijie Zhang ◽  
Zhengtian Song ◽  
G. M. Dilshan P. Godaliyadda ◽  
Dong Hye Ye ◽  
Atanu Sengupta ◽  
...  

Author(s):  
Linna Fan ◽  
Shize Zhang ◽  
Yichao Wu ◽  
Zhiliang Wang ◽  
Chenxin Duan ◽  
...  

2018 ◽  
Vol 2 (2) ◽  
pp. 70-82 ◽  
Author(s):  
Binglu Wang ◽  
Yi Bu ◽  
Win-bin Huang

AbstractIn the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.


Sign in / Sign up

Export Citation Format

Share Document