scholarly journals Author identification of short texts using dependency treebanks without vocabulary

2019 ◽  
Vol 35 (4) ◽  
pp. 812-825 ◽  
Author(s):  
Robert Gorman

Abstract How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2–89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification.

Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.


2020 ◽  
Author(s):  
Aviel J. Stein ◽  
Janith Weerasinghe ◽  
Spiros Mancoridis ◽  
Rachel Greenstadt

News articles are important for providing timely, historic information. However, the Internet is replete with text that may contain irrelevant or unhelpful information, therefore means of processing it and distilling content is important and useful to human readers as well as information extracting tools. Some common questions we may want to answer are “what is this article about?” and “who wrote it?”. In this work we compare machine learning models for evaluating two common NLP tasks, topic and authorship attribution, on the 2017 Vox Media dataset. Additionally, we use the models to classify on a subsection, about ~20%, of the original text which show to be better for classification than the provided blurbs. Because of the large number of topics, we take into account topic overlap and address it via top-n accuracy and hierarchical groupings of topics. We also consider edge cases in authorship by classifying on inter-topic and intra-topic author distributions. Our results show that both topics and authors readily identifiable consistently perform best when using neural networks rather than support vector, random forests, or naive Bayes classifiers, although the latter methods perform acceptably.


2019 ◽  
Vol 26 (1) ◽  
pp. 49-71 ◽  
Author(s):  
Renkui Hou ◽  
Chu-Ren Huang

AbstractIn this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2014 ◽  
Vol 136 (11) ◽  
Author(s):  
Michael W. Glier ◽  
Daniel A. McAdams ◽  
Julie S. Linsey

Bioinspired design is the adaptation of methods, strategies, or principles found in nature to solve engineering problems. One formalized approach to bioinspired solution seeking is the abstraction of the engineering problem into a functional need and then seeking solutions to this function using a keyword type search method on text based biological knowledge. These function keyword search approaches have shown potential for success, but as with many text based search methods, they produce a large number of results, many of little relevance to the problem in question. In this paper, we develop a method to train a computer to identify text passages more likely to suggest a solution to a human designer. The work presented examines the possibility of filtering biological keyword search results by using text mining algorithms to automatically identify which results are likely to be useful to a designer. The text mining algorithms are trained on a pair of surveys administered to human subjects to empirically identify a large number of sentences that are, or are not, helpful for idea generation. We develop and evaluate three text classification algorithms, namely, a Naïve Bayes (NB) classifier, a k nearest neighbors (kNN) classifier, and a support vector machine (SVM) classifier. Of these methods, the NB classifier generally had the best performance. Based on the analysis of 60 word stems, a NB classifier's precision is 0.87, recall is 0.52, and F score is 0.65. We find that word stem features that describe a physical action or process are correlated with helpful sentences. Similarly, we find biological jargon feature words are correlated with unhelpful sentences.


2021 ◽  
Vol 2021 ◽  
pp. 1-24
Author(s):  
Youness Mourtaji ◽  
Mohammed Bouhorma ◽  
Daniyal Alghazzawi ◽  
Ghadah Aldabbagh ◽  
Abdullah Alghamdi

The phenomenon of phishing has now been a common threat, since many individuals and webpages have been observed to be attacked by phishers. The common purpose of phishing activities is to obtain user’s personal information for illegitimate usage. Considering the growing intensity of the issue, this study is aimed at developing a new hybrid rule-based solution by incorporating six different algorithm models that may efficiently detect and control the phishing issue. The study incorporates 37 features extracted from six different methods including the black listed method, lexical and host method, content method, identity method, identity similarity method, visual similarity method, and behavioral method. Furthermore, comparative analysis was undertaken between different machine learning and deep learning models which includes CART (decision trees), SVM (support vector machines), or KNN ( K -nearest neighbors) and deep learning models such as MLP (multilayer perceptron) and CNN (convolutional neural networks). Findings of the study indicated that the method was effective in analysing the URL stress through different viewpoints, leading towards the validity of the model. However, the highest accuracy level was obtained for deep learning with the given values of 97.945 for the CNN model and 93.216 for the MLP model, respectively. The study therefore concludes that the new hybrid solution must be implemented at a practical level to reduce phishing activities, due to its high efficiency and accuracy.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


2020 ◽  
Vol 19 (2) ◽  
Author(s):  
Vitalia Fina Carla Rettobjaan

This study aims to analyze the Financial Ratio for Predicting Bankruptcy. The sample used in this study are SMEs according PEFINDO25 period 2013 to 2017. The independent variables in this study is liquidity, profitability, debt structure, solvency and activity ratio; and control variables is size and age, as well as the dependent variable is bankruptcy. The amount of sample in this study 32 companies PEFINDO25 by using purposive sampling. The method of data analysis is done by using logistic regression with SPSS version 23. The result of this research showed that liquidity, profitability and age has significant negative effect on bankruptcy. Debt structure has significant positive effect on bankruptcy. While solvency, activity ratio and size does not significantly effect on bankruptcy


Sign in / Sign up

Export Citation Format

Share Document