Author Identification from Handwritten Characters using Siamese CNN

Author(s):  
Nkosikhona Dlamini ◽  
Terence L van Zyl
2019 ◽  
Vol 35 (4) ◽  
pp. 812-825 ◽  
Author(s):  
Robert Gorman

Abstract How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2–89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification.


2018 ◽  
Vol 6 (2) ◽  
pp. 83
Author(s):  
Refat Aljumily

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule’s K measure, Simpson’s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (< 40%). There was a positive effect on identification performance using parts of speech and function word frequency usages and applying TF.IDF technique as the length of text messages increased (N≥ 100). Through this way, the performance and efficiency of syntactic features and function word usages to identify anonymous authors or text messages is improved by increasing the length of the text messages using TF.IDF variable selection technique, but decreased as feature type frequency and variance techniques in the selection process apply.


Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.


2019 ◽  
Vol 39 (06) ◽  
pp. 280-289 ◽  
Author(s):  
Raj Kumar Bhardwaj

The study aims to trace the development of Indian research data repositories (RDRs) and explore their content with the view of identifying prospects and possibilities. Further, it analyses the distribution of data repositories on the basis of content coverage, types of content, author identification system followed, software and the application programming interface used, subject wise number of repositories etc. The study is based on data repositories listed on the registry of data repositories accessible at http://www.re3data.org.The dataset was exported in Microsoft Excel format for analysis. A simple percentage method was followed in data analyses and results are presented through Tables and Figures. The study found a total of 2829 data repositories in existence worldwide. Further, it was seen that 1526 (53.9 %) are open and 924 (32.4 %) are restricted data repositories. Also, there are embargoed data repositories numbering 225 (8.0 %) and closed ones numbering 154 (5.4 %). There are 2829 RDRs covering 72 countries in the world. The study found that out of total 45 Indian RDRs, only 30 (67 %) are open, followed by restricted 12 (27 %) and 3 (6 %) that are closed. Majority of Indian RDRs (20) were developed in the year 2014. The study found that the majority of Indian RDRs (17) are‘disciplinary’. Further, the study also revealed that statistical data formats are available in a maximum of 31 (68.9 %) Indian RDRs. It was also seen that the majority of Indian RDRs (28) has datasets relating to ‘Life Sciences’. It was identified that only 20% of data repositories have been using metadata standards in metadata; the remaining 80% do not use any standards in metadata entry. This study covered only the research data repositories in India registered on the registry of data repositories. RDRs not listed in the registry of data repositories are left out.


Author(s):  
Michael P. Oakes

Author profiling is the analysis of people’s writing in an attempt to find out which classes they belong to, such as gender, age group or native language. Many of the techniques for author profiling are derived from the related task of Author Identification, so we will look at this topic first. Author identification is the task of finding out who is most likely to have written a disputed document, and there are a number of computational approaches to this. The three main subtasks are the compilation of corpora of texts known to be written by the candidate authors, the selection of linguistic features to represent those texts, and statistics for discriminating between those features which are most indicative of a particular author’s writing style. Plagiarism is the unacknowledged use of another author’s original work, and we will look at software for its detection. The chapter will cover the types of text obfuscation strategies used by plagiarists, commercial plagiarism detection software and its shortcomings, and recent research systems. Strategies have been developed for both external plagiarism detection (where the original source is searched for in a large document collection) and intrinsic plagiarism detection (where the source text is not available, necessitating a search for inconsistencies within the suspicious document). The specific problems of plagiarism by translation of an original in another language, and the unauthorized copying of sections of computer code, are described. Evaluation forums and publicly available test data sets are covered for each of the main topics of this chapter.


Sign in / Sign up

Export Citation Format

Share Document