document similarity
Recently Published Documents


TOTAL DOCUMENTS

243
(FIVE YEARS 68)

H-INDEX

19
(FIVE YEARS 2)

Author(s):  
Yan Xiang ◽  
Zhengtao Yu ◽  
Junjun Guo ◽  
Yuxin Huang ◽  
Yantuan Xian

Opinion target classification of microblog comments is one of the most important tasks for public opinion analysis about an event. Due to the high cost of manual labeling, opinion target classification is generally considered as a weak-supervised task. This article attempts to address the opinion target classification of microblog comments through an event graph convolution network (EventGCN) in a weak-supervised manner. Specifically, we take microblog contents and comments as document nodes, and construct an event graph with three typical relationships of event microblogs, including the co-occurrence relationship of event keywords extracted from microblogs, the reply relationship of comments, and the document similarity. Finally, under the supervision of a small number of labels, both word features and comment features can be represented well to complete the classification. The experimental results on two event microblog datasets show that EventGCN can significantly improve the classification performance compared with other baseline models.


2022 ◽  
Vol 2 (2) ◽  
pp. 90-95
Author(s):  
Muhammad Azmi

Plagiarism is the activity of duplicating or imitating the work of others then recognized as his own work without the author's permission or listing the source. Plagiarism or plagiarism is not something that is difficult to do because by using a copy-paste-modify technique in part or all of the document, the document can be said to be the result of plagiarism or duplication.             The practice of plagiarism occurs because students are accustomed to taking the writings of others without including the source of origin, even copying in its entirety and exactly the same. Plagiarism practices are mostly carried out by students, especially when completing the final project or thesis             One way that can be used to prevent the practice of plagiarism is by doing prevention and detecting. Plagiarism detection uses the concept of similarity or document similarity is one way to detect copy & paste plagiarism and disguised plagiarism. one of the right methods that can be done to detect plagiarism by analyzing the level of document plagiarism using the Cosine Similarity method and the TF-IDF weighting. This research produces an application that is able to process the similarity value of the document to be tested. Hasik testing shows that it is appropriate between manual calculations and implementation of algorithms in the application made. Use of the Literature Library is quite effective in the Stemming process. Calculations that use stemming will have a higher similarity value compared to calculations without stemming methods.


2021 ◽  
Author(s):  
Harshit Jain ◽  
Naveen Pundir

India and many other countries like UK, Australia, Canada follow the ‘common law system’ which gives substantial importance to prior related cases in determining the outcome of the current case. Better similarity methods can help in finding earlier similar cases, which can help lawyers searching for precedents. Prior approaches in computing similarity of legal judgements use a basic representation which is either abag-of-words or dense embedding which is learned by only using the words present in the document. They, however, either neglect or do not emphasize the vital ‘legal’ information in the judgements, e.g. citations to prior cases, act and article numbers or names etc. In this paper, we propose a novel approach to learn the embeddings of legal documents using the citationnetwork of documents. Experimental results demonstrate that the learned embedding is at par with the state-of-the-art methods for document similarity on a standard legal dataset.


2021 ◽  
pp. 146511652110644
Author(s):  
Maximilian Haag

Informal trilogue meetings are the main legislative bargaining forum in the European Union, yet their dynamics remain largely understudied in a quantitative context. This article builds on the assumption that the negotiating delegations of the European Parliament and the Council play a two-level game whereby these actors can use their intra-institutional constraint to extract inter-institutional bargaining success. Negotiators can credibly claim that their hands are tied if the members of their parent institutions hold similar preferences and do not accept alternative proposals or if their institution is divided and negotiators need to defend a fragile compromise. Employing a measure of document similarity (minimum edit distance) between an institution's negotiation mandate and the trilogue outcome to measure bargaining success, the analysis supports the hypothesis for the European Parliament, but not for the Council.


2021 ◽  
Vol 11 (24) ◽  
pp. 12040
Author(s):  
Mustafa A. Al Sibahee ◽  
Ayad I. Abdulsada ◽  
Zaid Ameen Abduljabbar ◽  
Junchao Ma ◽  
Vincent Omollo Nyangaresi ◽  
...  

Applications for document similarity detection are widespread in diverse communities, including institutions and corporations. However, currently available detection systems fail to take into account the private nature of material or documents that have been outsourced to remote servers. None of the existing solutions can be described as lightweight techniques that are compatible with lightweight client implementation, and this deficiency can limit the effectiveness of these systems. For instance, the discovery of similarity between two conferences or journals must maintain the privacy of the submitted papers in a lightweight manner to ensure that the security and application requirements for limited-resource devices are fulfilled. This paper considers the problem of lightweight similarity detection between document sets while preserving the privacy of the material. The proposed solution permits documents to be compared without disclosing the content to untrusted servers. The fingerprint set for each document is determined in an efficient manner, also developing an inverted index that uses the whole set of fingerprints. Before being uploaded to the untrusted server, this index is secured by the Paillier cryptosystem. This study develops a secure, yet efficient method for scalable encrypted document comparison. To evaluate the computational performance of this method, this paper carries out several comparative assessments against other major approaches.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Didi Surian ◽  
Florence T. Bourgeois ◽  
Adam G. Dunn

Abstract Background Clinical trial registries can be used as sources of clinical evidence for systematic review synthesis and updating. Our aim was to evaluate methods for identifying clinical trial registrations that should be screened for inclusion in updates of published systematic reviews. Methods A set of 4644 clinical trial registrations (ClinicalTrials.gov) included in 1089 systematic reviews (PubMed) were used to evaluate two methods (document similarity and hierarchical clustering) and representations (L2-normalised TF-IDF, Latent Dirichlet Allocation, and Doc2Vec) for ranking 163,501 completed clinical trials by relevance. Clinical trial registrations were ranked for each systematic review using seeding clinical trials, simulating how new relevant clinical trials could be automatically identified for an update. Performance was measured by the number of clinical trials that need to be screened to identify all relevant clinical trials. Results Using the document similarity method with TF-IDF feature representation and Euclidean distance metric, all relevant clinical trials for half of the systematic reviews were identified after screening 99 trials (IQR 19 to 491). The best-performing hierarchical clustering was using Ward agglomerative clustering (with TF-IDF representation and Euclidean distance) and needed to screen 501 clinical trials (IQR 43 to 4363) to achieve the same result. Conclusion An evaluation using a large set of mined links between published systematic reviews and clinical trial registrations showed that document similarity outperformed hierarchical clustering for identifying relevant clinical trials to include in systematic review updates.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Meijing Li ◽  
Tianjie Chen ◽  
Keun Ho Ryu ◽  
Cheng Hao Jin

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.


2021 ◽  
Author(s):  
M. Saef Ullah Miah ◽  
Junaida Sulaiman ◽  
Saiful Azad ◽  
Kamal Z. Zamli ◽  
Rajan Jose

Sign in / Sign up

Export Citation Format

Share Document