A robust authorship attribution on big period

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.

Download Full-text

Author identification of short texts using dependency treebanks without vocabulary

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz070 ◽

2019 ◽

Vol 35 (4) ◽

pp. 812-825 ◽

Cited By ~ 1

Author(s):

Robert Gorman

Keyword(s):

Text Classification ◽

Authorship Attribution ◽

Support Vector ◽

Ancient Greek ◽

Combinatorial Explosion ◽

Independent Variables ◽

Author Identification ◽

Important Addition ◽

Digital Methods ◽

And Control

Abstract How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2–89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification.

Download Full-text

Analysis of authorship attribution technique on Urdu tweets empowered by machine learning

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/911032021 ◽

2021 ◽

Vol 10 (3) ◽

pp. 2150-2157

Keyword(s):

Machine Learning ◽

Cosine Similarity ◽

Authorship Attribution ◽

Writing Style ◽

Accuracy Rate ◽

The World ◽

Linguistic Level ◽

N Gram ◽

Investigation Process

Theprocess of identifying the author of an anonymous document from a group of unknown documents is called authorship attribution. As the world is trending towards shorter communications, the trend of online criminal activities like phishing and bullying are also increasing. The criminal hides their identity behind the screen name and connects anonymously. Which generates difficulty while tracing criminals during the cybercrime investigation process. This paper evaluates current techniques of authorship attribution at the linguistic level and compares the accuracy rate in terms of English and Urdu context, by using the LDA model with n-gram technique and cosine similarity, used to work on Stylometry features to identify the writing style of a specific author. Two datasets are used Urdu_TD and English_TD based on 180 English and Urdu tweets against each author. The overall accuracy that we achieved from Urdu_TD is 84.52% accuracy and 93.17% accuracy on English_TD. The task is done without using any labels for authorship

Download Full-text

Multi-View Vehicle Recognition Based on WRT-SVM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.694-697.1987 ◽

2013 ◽

Vol 694-697 ◽

pp. 1987-1992 ◽

Cited By ~ 1

Author(s):

Xing Gang Wu ◽

Cong Guo

Keyword(s):

False Positive Rate ◽

Identification Problem ◽

Classification Problem ◽

Random Trees ◽

Support Vector ◽

Image Size ◽

Scale Invariant ◽

Vehicle Recognition ◽

Positive Rate ◽

Image Pairs

Proposed an approach to identify vehicles considering the variation in image size, illumination, and view angles under different cameras using Support Vector Machine with weighted random trees (WRT-SVM). With quantizing the scale-invariant features of image pairs by the weighted random trees, the identification problem is formulated as a same-different classification problem. Results show the efficiency of building the randomized tree due to the weights of the samples and the control of the false-positive rate of the identify system.

Download Full-text

A Self-Supervised Representation Learning of Sentence Structure for Authorship Attribution

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3491203 ◽

2022 ◽

Vol 16 (4) ◽

pp. 1-16

Author(s):

Fereshteh Jafariakinabad ◽

Kien A. Hua

Keyword(s):

Structural Information ◽

Syntactic Structure ◽

Representation Learning ◽

Authorship Attribution ◽

Sentence Structure ◽

Vector Representation ◽

Writing Style ◽

Neural Models ◽

Syntactic Information ◽

Classification Tasks

The syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this article, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the n -to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.

Download Full-text

Implementation of n-gram Methodology for Rotten Tomatoes Review Dataset Sentiment Analysis

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/ijkdb.2017010103 ◽

2017 ◽

Vol 7 (1) ◽

pp. 30-41 ◽

Cited By ~ 12

Author(s):

Prayag Tiwari ◽

Brojo Kishore Mishra ◽

Sachin Kumar ◽

Vivek Kumar

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Sentiment Analysis ◽

Maximum Entropy ◽

Learning Strategies ◽

Supervised Machine Learning ◽

Support Vector ◽

N Gram ◽

F Measure ◽

Blog Posts

Sentiment Analysis intends to get the basic perspective of the content, which may be anything that holds a subjective supposition, for example, an online audit, Comments on Blog posts, film rating and so forth. These surveys and websites might be characterized into various extremity gatherings, for example, negative, positive, and unbiased keeping in mind the end goal to concentrate data from the info dataset. Supervised machine learning strategies group these reviews. In this paper, three distinctive machine learning calculations, for example, Support Vector Machine (SVM), Maximum Entropy (ME) and Naive Bayes (NB), have been considered for the arrangement of human conclusions. The exactness of various strategies is basically inspected keeping in mind the end goal to get to their execution on the premise of parameters, e.g. accuracy, review, f-measure, and precision.

Download Full-text

Simulations and difficult problems

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz034 ◽

2019 ◽

Author(s):

David L Hoover

Keyword(s):

Early Modern ◽

Authorship Attribution ◽

The Third ◽

Adequate Number ◽

Uncertain Evidence ◽

N Gram ◽

Problematic Situations ◽

Initial Uncertainty ◽

Number Of Authors

Abstract An authorship attribution investigation ideally begins with a well-defined set of possible authors and an adequate number of firmly attributed roughly contemporaneous long texts in the same genre by those authors. Many significant or intriguing problems, however, suffer from deficiencies or limitations that reduce the effectiveness or validity of some kinds of analysis and make others impossible. These problematic situations can be approached by creating simulations that are designed to overcome or mitigate the difficulties of the problems. The results of the simulations can be used to suggest at least tentative solutions. Here, simulations are used to investigate four difficult problems. One involves fewer and shorter texts than would be ideal–texts that are also chronologically earlier than the known texts by the target author. The second involves too small a number of well attributed texts by the authors in question, and initial uncertainty about the genres of the texts, the number of authors involved, and their genders. The third is a tricky case of co-authorship with only relatively vague and uncertain evidence about the nature and extent of each author’s contribution; here simulations with sections of well-attributed texts by the two authors are used to test Rolling Classify. The fourth addresses the sparsity of well-attributed and confidently-dated Early Modern plays, using simulations to evaluate Brian Vickers’ rare n-gram approach to the attribution of such plays.

Download Full-text

Dynamics and an efficient malware detection system using opcode sequence graph generation and ml algorithm

E3S Web of Conferences ◽

10.1051/e3sconf/202018401009 ◽

2020 ◽

Vol 184 ◽

pp. 01009

Author(s):

Bharathi Panduri ◽

Madhurika Vummenthala ◽

Spoorthi Jonnalagadda ◽

Garwandha Ashwini ◽

Naruvadi Nagamani ◽

...

Keyword(s):

Detection System ◽

Future Research ◽

Support Vector ◽

Web Page ◽

Code Injection ◽

Mission Success ◽

Sequence Graph ◽

N Gram ◽

Iot Devices ◽

Garbage Code

IoT(Internet of things), for the most part, comprises of the various scope of Internet-associated gadgets and hubs. In the context of military and defence systems (called as IoBT) these gadgets could be personnel wearable battle outfits, tracking devices, cameras, clinical gadgets etc., The integrity and safety of these devices are critical in mission success and it is of utmost importance to keep them secure. One of the typical ways of the attack on these gadgets is through the use of malware, whose aim could be to compromise the device and or breach the communications. Generally, these IoBT gadgets and hubs are a much more significant target for cyber criminals due to the value they pose, more so than IoT devices. In this paper we attempt at creating a significant learning based procedure to distinguish, classify and tracksuch malware in IoBT(Internet of battlefield things) through operational codes progression. This is achieved by transforming the aforementioned OpCodes into a vector space, upon which a Deep Eigen space learning technique is applied to differentiate between harmful and safe applications. For robust classification, Support vector machine and n gram Sequencing algorithms are proposed in this paper. Moreover, we evaluate the quality of our proposed approach in malware recognition and also its maintainability against garbage code injection assault. These results are presented on a web page which has separate components and levels of accessibility for user and admin credentials. For the purpose of tracking the prevalence of various malwares on the network, counts and against garbage code injection assault. These results are presented on a web page which has separate components and levels of accessibility for user and admin credentials. For the purpose of tracking the prevalence of various malwares on the network, counts and trends of different malicious opcodes are displayed for both user and admin. Thereby our proposed approach will be beneficial for the users, especially for those who want to communicate confidential information within the network. It is also beneficial if a user wants to know whether a message is secure or not. This has also been made malware test accessible, which ideally will profit future research endeavors.

Download Full-text

Automatic language identification using support vector machines and phonetic N-gram

2008 International Conference on Audio, Language and Image Processing ◽

10.1109/icalip.2008.4590023 ◽

2008 ◽

Cited By ~ 4

Author(s):

Yan Deng ◽

Jia Liu

Keyword(s):

Support Vector Machines ◽

Language Identification ◽

Support Vector ◽

Vector Machines ◽

N Gram

Download Full-text

N-gram support vector machines for scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive care unit

Journal of the American Medical Informatics Association ◽

10.1136/amiajnl-2014-002694 ◽

2014 ◽

Vol 21 (5) ◽

pp. 871-875 ◽

Cited By ~ 31

Author(s):

Ben J Marafino ◽

Jason M Davies ◽

Naomi S Bardach ◽

Mitzi L Dean ◽

R Adams Dudley

Keyword(s):

Intensive Care Unit ◽

Intensive Care ◽

Support Vector Machines ◽

Support Vector ◽

Free Text ◽

Text Data ◽

Vector Machines ◽

N Gram ◽

Diagnosis Classification

Download Full-text

Detecting sexual predators in chats using behavioral features and imbalanced learning

Natural Language Engineering ◽

10.1017/s1351324916000395 ◽

2017 ◽

Vol 23 (4) ◽

pp. 589-616 ◽

Cited By ~ 2

Author(s):

CLAUDIA CARDEI ◽

TRAIAN REBEDEA

Keyword(s):

Text Categorization ◽

Real Life ◽

Identification Problem ◽

Online Discussions ◽

Support Vector ◽

Two Stage ◽

Online Chat ◽

Sexual Predator ◽

Sexual Predators ◽

Behavioral Features

AbstractThis paper presents a system developed for detecting sexual predators in online chat conversations using a two-stage classification and behavioral features. A sexual predator is defined as a person who tries to obtain sexual favors in a predatory manner, usually with underage people. The proposed approach uses several text categorization methods and empirical behavioral features developed especially for the task at hand. After investigating various approaches for solving the sexual predator identification problem, we have found that a two-stage classifier achieves the best results. In the first stage, we employ a Support Vector Machine classifier to distinguish conversations having suspicious content from safe online discussions. This is useful as most chat conversations in real life do not contain a sexual predator, therefore it can be viewed as a filtering phase that enables the actual detection of predators to be done only for suspicious chats that contain a sexual predator with a very high degree. In the second stage, we detect which of the users in a suspicious discussion is an actual predator using a Random Forest classifier. The system was tested on the corpus provided by the PAN 2012 workshop organizers and the results are encouraging because, as far as we know, our solution outperforms all previous approaches developed for solving this task.

Download Full-text