Analysis of authorship attribution technique on Urdu tweets empowered by machine learning

Theprocess of identifying the author of an anonymous document from a group of unknown documents is called authorship attribution. As the world is trending towards shorter communications, the trend of online criminal activities like phishing and bullying are also increasing. The criminal hides their identity behind the screen name and connects anonymously. Which generates difficulty while tracing criminals during the cybercrime investigation process. This paper evaluates current techniques of authorship attribution at the linguistic level and compares the accuracy rate in terms of English and Urdu context, by using the LDA model with n-gram technique and cosine similarity, used to work on Stylometry features to identify the writing style of a specific author. Two datasets are used Urdu_TD and English_TD based on 180 English and Urdu tweets against each author. The overall accuracy that we achieved from Urdu_TD is 84.52% accuracy and 93.17% accuracy on English_TD. The task is done without using any labels for authorship

Download Full-text

A robust authorship attribution on big period

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3167-3174 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3167 ◽

Cited By ~ 1

Author(s):

Mubin Shoukat Tamboli ◽

Rajesh Prasad

Keyword(s):

Identification Problem ◽

Authorship Attribution ◽

Support Vector ◽

Writing Style ◽

Author Identification ◽

Time Period ◽

N Gram ◽

Corpus Selection ◽

Writing Sample ◽

Small Period

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.

Download Full-text

Unsupervised authorship attribution using feature selection and weighted cosine similarity

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219226 ◽

2021 ◽

pp. 1-11

Author(s):

Carolina Martín-del-Campo-Rodríguez ◽

Grigori Sidorov ◽

Ildar Batyrshin

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Computational Model ◽

Clustering Algorithm ◽

State Of The Art ◽

Extraction Methods ◽

Cosine Similarity ◽

Authorship Attribution ◽

Selection Methods ◽

Cosine Similarity Measure

This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character n-grams. The Weighted cosine similarity measure is applied to improve the B 3 F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.

Download Full-text

Authorship Attribution for Online Social Media

Advances in Business Information Systems and Analytics - Social Network Analytics for Contemporary Business Organizations ◽

10.4018/978-1-5225-5097-6.ch008 ◽

2018 ◽

pp. 141-165 ◽

Cited By ~ 1

Author(s):

Ritu Banga ◽

Akanksha Bhardwaj ◽

Sheng-Lung Peng ◽

Gulshan Shrivastava

Keyword(s):

Machine Learning ◽

Social Media ◽

Authorship Attribution ◽

The Internet ◽

Writing Style ◽

Machine Learning Classifiers ◽

Online Social Media ◽

Comprehensive Knowledge ◽

Authorship Identification ◽

Cyber Ethics

This chapter gives a comprehensive knowledge of various machine learning classifiers to achieve authorship attribution (AA) on short texts, specifically tweets. The need for authorship identification is due to the increasing crime on the internet, which breach cyber ethics by raising the level of anonymity. AA of online messages has witnessed interest from many research communities. Many methods such as statistical and computational have been proposed by linguistics and researchers to identify an author from their writing style. Various ways of extracting and selecting features on the basis of dataset have been reviewed. The authors focused on n-grams features as they proved to be very effective in identifying the true author from a given list of known authors. The study has demonstrated that AA is achievable on the basis of selection criteria of features and methods in small texts and also proved the accuracy of analysis changes according to combination of features. The authors found character grams are good features for identifying the author but are not yet able to identify the author independently.

Download Full-text

Versification and Authorship Attribution

10.14712/9788024648903 ◽

2021 ◽

Author(s):

Petr Plecháč

Keyword(s):

Machine Learning ◽

Nineteenth Century ◽

William Shakespeare ◽

Authorship Attribution ◽

John Fletcher ◽

The World ◽

Russian Author ◽

Poetic Texts

The technique known as contemporary stylometry uses different methods, including machine learning, to discover a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint stylometry tends to ignore: versification, or the very making of language into verse. Using poetic texts in three different languages (Czech, German, and Spanish), Petr Plecháč asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. He then tests its findings on two unsolved literary mysteries. In the first, Plecháč distinguishes the parts of the Elizabethan verse play The Two Noble Kinsmen written by William Shakespeare from those written by his coauthor, John Fletcher. In the second, he seeks to solve a case of suspected forgery: how authentic was a group of poems first published as the work of the nineteenth-century Russian author Gavriil Stepanovich Batenkov? This book of poetic investigation should appeal to literary sleuths the world over.

Download Full-text

Authorship Attribution in Huayan Texts by Machine Learning using N-gram and SVM

International Journal of Buddhist Thought and Culture ◽

10.16893/ijbtc.2018.12.28.2.69 ◽

2018 ◽

Vol 28 (2) ◽

pp. 69-86

Author(s):

Boram PARK

Keyword(s):

Machine Learning ◽

Authorship Attribution ◽

N Gram

Download Full-text

Feature extraction and prediction of Dengue Outbreaks

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206544 ◽

2020 ◽

pp. 216-222

Author(s):

Kunal Parikh ◽

Tanvi Makadia ◽

Harshil Patel

Keyword(s):

Public Health ◽

Machine Learning ◽

Developing Countries ◽

Feature Extraction ◽

Predictive Analytics ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Health Concerns ◽

The World ◽

Dengue Outbreaks

Dengue is unquestionably one of the biggest health concerns in India and for many other developing countries. Unfortunately, many people have lost their lives because of it. Every year, approximately 390 million dengue infections occur around the world among which 500,000 people are seriously infected and 25,000 people have died annually. Many factors could cause dengue such as temperature, humidity, precipitation, inadequate public health, and many others. In this paper, we are proposing a method to perform predictive analytics on dengue’s dataset using KNN: a machine-learning algorithm. This analysis would help in the prediction of future cases and we could save the lives of many.

Download Full-text

Platform for Analysing and Encouraging Student Activity on Contest and E-learning Systems

OLYMPIADS IN INFORMATICS ◽

10.15388/ioi.2018.07 ◽

2018 ◽

Vol 12 ◽

pp. 85-98

Author(s):

Bojan Kostadinov ◽

Mile Jovanov ◽

Emil STANKOV

Keyword(s):

Machine Learning ◽

Data Collection ◽

Educational Policy ◽

Learning Systems ◽

Data Sources ◽

Or Education ◽

Student Activity ◽

The World ◽

E Learning ◽

Analyse Data

Data collection and machine learning are changing the world. Whether it is medicine, sports or education, companies and institutions are investing a lot of time and money in systems that gather, process and analyse data. Likewise, to improve competitiveness, a lot of countries are making changes to their educational policy by supporting STEM disciplines. Therefore, it’s important to put effort into using various data sources to help students succeed in STEM. In this paper, we present a platform that can analyse student’s activity on various contest and e-learning systems, combine and process the data, and then present it in various ways that are easy to understand. This in turn enables teachers and organizers to recognize talented and hardworking students, identify issues, and/or motivate students to practice and work on areas where they’re weaker.

Download Full-text

CIKM 2020 conference report

ACM SIGWEB Newsletter ◽

10.1145/3460304.3460305 ◽

2021 ◽

pp. 1-4

Author(s):

Mathieu D'Aquin ◽

Stefan Dietze

Keyword(s):

United States ◽

Machine Learning ◽

Knowledge Management ◽

Information Retrieval ◽

Conference Report ◽

The United States ◽

Knowledge Based ◽

The World ◽

Science Conference ◽

Information And Knowledge Management

The 29th ACM International Conference on Information and Knowledge Management (CIKM) was held online from the 19 th to the 23 rd of October 2020. CIKM is an annual computer science conference, focused on research at the intersection of information retrieval, machine learning, databases as well as semantic and knowledge-based technologies. Since it was first held in the United States in 1992, 28 conferences have been hosted in 9 countries around the world.

Download Full-text

COVID-19 Misinformation Online and Health Literacy: A Brief Overview

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18158091 ◽

2021 ◽

Vol 18 (15) ◽

pp. 8091

Author(s):

Salman Bin Naeem ◽

Maged N. Kamel Boulos

Keyword(s):

Machine Learning ◽

Social Media ◽

Health Literacy ◽

Information Content ◽

Digital Health ◽

Negative Effects ◽

The World ◽

Synergistic Approach ◽

Viral Nature

Low digital health literacy affects large percentages of populations around the world and is a direct contributor to the spread of COVID-19-related online misinformation (together with bots). The ease and ‘viral’ nature of social media sharing further complicate the situation. This paper provides a quick overview of the magnitude of the problem of COVID-19 misinformation on social media, its devastating effects, and its intricate relation to digital health literacy. The main strategies, methods and services that can be used to detect and prevent the spread of COVID-19 misinformation, including machine learning-based approaches, health literacy guidelines, checklists, mythbusters and fact-checkers, are then briefly reviewed. Given the complexity of the COVID-19 infodemic, it is very unlikely that any of these approaches or tools will be fully effective alone in stopping the spread of COVID-19 misinformation. Instead, a mixed, synergistic approach, combining the best of these strategies, methods, and services together, is highly recommended in tackling online health misinformation, and mitigating its negative effects in COVID-19 and future pandemics. Furthermore, techniques and tools should ideally focus on evaluating both the message (information content) and the messenger (information author/source) and not just rely on assessing the latter as a quick and easy proxy for the trustworthiness and truthfulness of the former. Surveying and improving population digital health literacy levels are also essential for future infodemic preparedness.

Download Full-text

HOW INDUSTRY 4.0 RESHAPES THE WORLD: RECOMMENDATIONS BASED ON COMPLEX GRAPH NETWORK ANALYSIS

Proceedings of the Design Society ◽

10.1017/pds.2021.437 ◽

2021 ◽

Vol 1 ◽

pp. 1755-1764

Author(s):

Rongyan Zhou ◽

Julie Stal-Le Cardinal

Keyword(s):

Machine Learning ◽

Industry 4.0 ◽

Network Models ◽

Global Supply Chain ◽

Great Opportunity ◽

Strategic Research ◽

Complex Graph ◽

The World ◽

Tremendous Challenge

Abstract Industry 4.0 is a great opportunity and a tremendous challenge for every role of society. Our study combines complex network and qualitative methods to analyze the Industry 4.0 macroeconomic issues and global supply chain, which enriches the qualitative analysis and machine learning in macroscopic and strategic research. Unsupervised complex graph network models are used to explore how industry 4.0 reshapes the world. Based on the in-degree and out-degree of the weighted and unweighted edges of each node, combined with the grouping results based on unsupervised learning, our study shows that the cooperation groups of Industry 4.0 are different from the previous traditional alliances. Macroeconomics issues also are studied. Finally, strong cohesive groups and recommendations for businessmen and policymakers are proposed.

Download Full-text