Author Identification from Handwritten Characters using Siamese CNN

Abstract How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2–89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification.

Download Full-text

Author Identification of Micro-Messages via Multi-Channel Convolutional Neural Networks

2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/smc42975.2020.9283214 ◽

2020 ◽

Author(s):

Sarp Aykent ◽

Gerry Dozier

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Author Identification

Download Full-text

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

2021 13th International Conference on Machine Learning and Computing ◽

10.1145/3457682.3457709 ◽

2021 ◽

Author(s):

Pranali Bora ◽

Tulika Awalgaonkar ◽

Himanshu Palve ◽

Raviraj Joshi ◽

Purvi Goel

Keyword(s):

Neural Network ◽

Source Code ◽

Network Approach ◽

Neural Network Approach ◽

Author Identification ◽

Hierarchical Neural Network

Download Full-text

Author Identification on Literature in Different Languages: A Systematic Survey

2018 International Conference On Advances in Communication and Computing Technology (ICACCT) ◽

10.1109/icacct.2018.8529635 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kale Sunil Digamberrao ◽

Rajesh S. Prasad

Keyword(s):

Systematic Survey ◽

Author Identification

Download Full-text

Evaluation of the Performance and Efficiency of the Automated Linguistic Features for Author Identification in Short Text Messages Using Different Variable Selection Techniques

Studies in Media and Communication ◽

10.11114/smc.v6i2.3892 ◽

2018 ◽

Vol 6 (2) ◽

pp. 83

Author(s):

Refat Aljumily

Keyword(s):

Variable Selection ◽

Text Messages ◽

Sentence Length ◽

Function Word ◽

Linguistic Features ◽

Linguistic Feature ◽

Short Text ◽

Parts Of Speech ◽

Author Identification ◽

Type Frequency

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule’s K measure, Simpson’s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (< 40%). There was a positive effect on identification performance using parts of speech and function word frequency usages and applying TF.IDF technique as the length of text messages increased (N≥ 100). Through this way, the performance and efficiency of syntactic features and function word usages to identify anonymous authors or text messages is improved by increasing the length of the text messages using TF.IDF variable selection technique, but decreased as feature type frequency and variance techniques in the selection process apply.

Download Full-text

USING TEXT MINING AND RANDOM FORESTS FOR AUTHOR IDENTIFICATION. THE CASE OF CLARIVATE WEB OF SCIENCE DATABASE

10.12948/ie2019.01.01 ◽

2019 ◽

Author(s):

Marin FOTACHE

Keyword(s):

Text Mining ◽

Random Forests ◽

Web Of Science ◽

Author Identification

Download Full-text

A robust authorship attribution on big period

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3167-3174 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3167 ◽

Cited By ~ 1

Author(s):

Mubin Shoukat Tamboli ◽

Rajesh Prasad

Keyword(s):

Identification Problem ◽

Authorship Attribution ◽

Support Vector ◽

Writing Style ◽

Author Identification ◽

Time Period ◽

N Gram ◽

Corpus Selection ◽

Writing Sample ◽

Small Period

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.

Download Full-text

A Content Analysis of Indian Research Data Repositories Prospects and Possibilities

DESIDOC Journal of Library & Information Technology ◽

10.14429/djlit.39.06.15137 ◽

2019 ◽

Vol 39 (06) ◽

pp. 280-289 ◽

Cited By ~ 1

Author(s):

Raj Kumar Bhardwaj

Keyword(s):

Application Programming Interface ◽

Research Data ◽

Identification System ◽

Microsoft Excel ◽

Data Repositories ◽

Data Formats ◽

Author Identification ◽

Metadata Standards ◽

Application Programming ◽

Content Coverage

The study aims to trace the development of Indian research data repositories (RDRs) and explore their content with the view of identifying prospects and possibilities. Further, it analyses the distribution of data repositories on the basis of content coverage, types of content, author identification system followed, software and the application programming interface used, subject wise number of repositories etc. The study is based on data repositories listed on the registry of data repositories accessible at http://www.re3data.org.The dataset was exported in Microsoft Excel format for analysis. A simple percentage method was followed in data analyses and results are presented through Tables and Figures. The study found a total of 2829 data repositories in existence worldwide. Further, it was seen that 1526 (53.9 %) are open and 924 (32.4 %) are restricted data repositories. Also, there are embargoed data repositories numbering 225 (8.0 %) and closed ones numbering 154 (5.4 %). There are 2829 RDRs covering 72 countries in the world. The study found that out of total 45 Indian RDRs, only 30 (67 %) are open, followed by restricted 12 (27 %) and 3 (6 %) that are closed. Majority of Indian RDRs (20) were developed in the year 2014. The study found that the majority of Indian RDRs (17) are‘disciplinary’. Further, the study also revealed that statistical data formats are available in a maximum of 31 (68.9 %) Indian RDRs. It was also seen that the majority of Indian RDRs (28) has datasets relating to ‘Life Sciences’. It was identified that only 20% of data repositories have been using metadata standards in metadata; the remaining 80% do not use any standards in metadata entry. This study covered only the research data repositories in India registered on the registry of data repositories. RDRs not listed in the registry of data repositories are left out.

Download Full-text

Source Code Author Identification Method Combining Semantics and Statistical Features

Business Intelligence and Information Technology - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-3-030-92632-8_14 ◽

2021 ◽

pp. 141-151

Author(s):

Xu Sun ◽

Yutong Sun ◽

Leilei Kong ◽

Yong Han ◽

Hui Ning

Keyword(s):

Source Code ◽

Statistical Features ◽

Identification Method ◽

Author Identification

Download Full-text

Author Profiling and Related Applications

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.53 ◽

2019 ◽

Author(s):

Michael P. Oakes

Keyword(s):

Computer Code ◽

Data Sets ◽

Plagiarism Detection ◽

Linguistic Features ◽

Original Source ◽

Research Systems ◽

Author Identification ◽

Document Collection ◽

Intrinsic Plagiarism Detection ◽

Author Profiling

Author profiling is the analysis of people’s writing in an attempt to find out which classes they belong to, such as gender, age group or native language. Many of the techniques for author profiling are derived from the related task of Author Identification, so we will look at this topic first. Author identification is the task of finding out who is most likely to have written a disputed document, and there are a number of computational approaches to this. The three main subtasks are the compilation of corpora of texts known to be written by the candidate authors, the selection of linguistic features to represent those texts, and statistics for discriminating between those features which are most indicative of a particular author’s writing style. Plagiarism is the unacknowledged use of another author’s original work, and we will look at software for its detection. The chapter will cover the types of text obfuscation strategies used by plagiarists, commercial plagiarism detection software and its shortcomings, and recent research systems. Strategies have been developed for both external plagiarism detection (where the original source is searched for in a large document collection) and intrinsic plagiarism detection (where the source text is not available, necessitating a search for inconsistencies within the suspicious document). The specific problems of plagiarism by translation of an original in another language, and the unauthorized copying of sections of computer code, are described. Evaluation forums and publicly available test data sets are covered for each of the main topics of this chapter.

Download Full-text