scholarly journals Finding Efficient Linguistic Feature Set for Authorship Verification

2013 ◽  
Vol 1 (1) ◽  
pp. 35-43
Author(s):  
Sandaruwan Prabath Kumara Ranatunga

Authorship verification rely on identification of a given document is written by a particular author or not. Internally analyzing the document itself with respect to the variations in writing style of the author and identification of the author’s own idiolect is the main context of the authorship verification. Mainly, the detection performance depends on the used feature set for clustering the document. Linguistic features and stylistic features have been utilized for author identification according to the writing style of a particular author. Disclose the shallow changes of the author’s writing style is the major problem which should be addressed in the domain of authorship verification. It motivates the computer science researchers to do research on authorship verification in the field of computer forensics and this research also focuses this problem. The contributions from the research are two folded: Former is introducing a new feature extracting method with Natural Language Processing (NLP) and later is propose a new more efficient linguistic feature set for verification of author of the given document. Experiments on a corpus composed of freely downloadable genuine 19th century English Books and Self Organizing Maps has been used as the classifier to cluster the documents. Proper word segmentation also introduced in this work and it helps to demonstrate that the proposed strategy can produced promising results. Finally, it is realized that more accurate classification is generated by the proposed strategy with extracted linguistic feature set.

Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.


2018 ◽  
Vol 6 (2) ◽  
pp. 83
Author(s):  
Refat Aljumily

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule’s K measure, Simpson’s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (< 40%). There was a positive effect on identification performance using parts of speech and function word frequency usages and applying TF.IDF technique as the length of text messages increased (N≥ 100). Through this way, the performance and efficiency of syntactic features and function word usages to identify anonymous authors or text messages is improved by increasing the length of the text messages using TF.IDF variable selection technique, but decreased as feature type frequency and variance techniques in the selection process apply.


2021 ◽  
Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose: attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods: we enrolled 51 young participants from December 2019 to February 2020 (age range: 14-18): 17 girls with a clinical diagnosis of AN, and 34 normal weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10-15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured.Results: comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. Conclusion: these preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption.Level of evidence III: evidence obtained from case-control analytic studies.


2019 ◽  
Vol 35 (2) ◽  
pp. 471-484
Author(s):  
Chunlin Wang ◽  
Irene Castellón ◽  
Elisabet Comelles

Abstract Semantic Textual Similarity (STS), which measures the equivalence of meanings between two textual segments, is an important and useful task in Natural Language Processing. In this article, we have analyzed the datasets provided by the Semantic Evaluation (SemEval) 2012–2014 campaigns for this task in order to find out appropriate linguistic features for each dataset, taking into account the influence that linguistic features at different levels (e.g. syntactic constituents and lexical semantics) might have on the sentence similarity. Results indicate that a linguistic feature may have a different effect on different corpus due to the great difference in sentence structure and vocabulary between datasets. Thus, we conclude that the selection of linguistic features according to the genre of the text might be a good strategy for obtaining better results in the STS task. This analysis could be a useful reference for measuring system building and linguistic feature tuning.


Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Abstract Over the last few decades, there has been tremendous growth in online communication through different types of media. Communication via the Internet is anonymous, which causes a critical issue regarding identity tracing. Authorship identification can apply to tasks such as identifying an anonymous author, detecting plagiarism, or finding a ghostwriter. Previous research has outlined the various methods and their improvements for the identification of anonymous authors based on stylometry. However, changes in the writing style of an author over a long period has not been addressed. In this article, we propose a methodology for author identification where the writing style of an author changes. The proposed methodology consists of two phases: the first will show the change in writing style of the author and in another phase the change is mitigated by a new feature normalization technique. A novel Transform Feature to Current Time function is proposed for normalization, where features are shifted to current time and made available for further classification. A machine-learning algorithm is used to identify an author candidate. The experiments of the proposed methodology conducted on a set of text samples by several authors were collected over a different time period and the results show an improvement in performance.


Author(s):  
Shizhe Chen ◽  
Qin Jin ◽  
Alexander Hauptmann

Bilingual lexicon induction, translating words from the source language to the target language, is a long-standing natural language processing task. Recent endeavors prove that it is promising to employ images as pivot to learn the lexicon induction without reliance on parallel corpora. However, these vision-based approaches simply associate words with entire images, which are constrained to translate concrete words and require object-centered images. We humans can understand words better when they are within a sentence with context. Therefore, in this paper, we propose to utilize images and their associated captions to address the limitations of previous approaches. We propose a multi-lingual caption model trained with different mono-lingual multimodal data to map words in different languages into joint spaces. Two types of word representation are induced from the multi-lingual caption model: linguistic features and localized visual features. The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant. The localized visual feature is attended to the region in the image that correlates to the word, so that it alleviates the image restriction for salient visual representation. The two types of features are complementary for word translation. Experimental results on multiple language pairs demonstrate the effectiveness of our proposed method, which substantially outperforms previous vision-based approaches without using any parallel sentences or supervision of seed word pairs.


Author(s):  
Una Stojnić

On the received view, the resolution of context-sensitivity is at least partly determined by non-linguistic features of utterance situation. If I say ‘He’s happy’, what ‘he’ picks out is underspecified by its linguistic meaning, and is only fixed through extra-linguistic supplementation: the speaker’s intention, and/or some objective, non-linguistic feature of the utterance situation. This underspecification is exhibited by most context-sensitive expressions, with the exception of pure indexicals, like ‘I.’ While this received view is prima facie appealing, I argue it is deeply mistaken. I defend an account according to which context-sensitivity resolution is governed by linguistic mechanisms determining prominence of candidate resolutions of context-sensitive items. Thus, on this account, the linguistic meaning of a context-sensitive expression fully specifies its resolution in a context, automatically selecting the resolution antecedently set by the prominence-governing linguistic mechanisms.


Author(s):  
Santosh Kumar Mishra ◽  
Rijul Dhir ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .


Author(s):  
Rodrigo Igawa ◽  
Alex Almeida ◽  
Bruno Zarpelão ◽  
Sylvio Jr

In this work, we propose an approach for recognition of compromised Twitter accounts based on Authorship Verification. Our solution can detect accounts that became compromised by analysing their user writing styles. This way, when an account content does not match its user writing style, we affirm that the account has been compromised, similar to Authorship Verification. Our approach follows the profile-based paradigm and uses N-grams as its kernel. Then, a threshold is found to represent the boundary of an account writing style. Experiments were performed using a subsampled dataset from Twitter. Experimental results showed that the developed model is very suitable for compromised recognition of Online Social Networks accounts due to the capability of recognize user styles over 95% accuracy.


Sign in / Sign up

Export Citation Format

Share Document