scholarly journals Analysis of authorship attribution technique on Urdu tweets empowered by machine learning

Theprocess of identifying the author of an anonymous document from a group of unknown documents is called authorship attribution. As the world is trending towards shorter communications, the trend of online criminal activities like phishing and bullying are also increasing. The criminal hides their identity behind the screen name and connects anonymously. Which generates difficulty while tracing criminals during the cybercrime investigation process. This paper evaluates current techniques of authorship attribution at the linguistic level and compares the accuracy rate in terms of English and Urdu context, by using the LDA model with n-gram technique and cosine similarity, used to work on Stylometry features to identify the writing style of a specific author. Two datasets are used Urdu_TD and English_TD based on 180 English and Urdu tweets against each author. The overall accuracy that we achieved from Urdu_TD is 84.52% accuracy and 93.17% accuracy on English_TD. The task is done without using any labels for authorship

Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.


2021 ◽  
pp. 1-11
Author(s):  
Carolina Martín-del-Campo-Rodríguez ◽  
Grigori Sidorov ◽  
Ildar Batyrshin

This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character n-grams. The Weighted cosine similarity measure is applied to improve the B 3 F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.


Author(s):  
Ritu Banga ◽  
Akanksha Bhardwaj ◽  
Sheng-Lung Peng ◽  
Gulshan Shrivastava

This chapter gives a comprehensive knowledge of various machine learning classifiers to achieve authorship attribution (AA) on short texts, specifically tweets. The need for authorship identification is due to the increasing crime on the internet, which breach cyber ethics by raising the level of anonymity. AA of online messages has witnessed interest from many research communities. Many methods such as statistical and computational have been proposed by linguistics and researchers to identify an author from their writing style. Various ways of extracting and selecting features on the basis of dataset have been reviewed. The authors focused on n-grams features as they proved to be very effective in identifying the true author from a given list of known authors. The study has demonstrated that AA is achievable on the basis of selection criteria of features and methods in small texts and also proved the accuracy of analysis changes according to combination of features. The authors found character grams are good features for identifying the author but are not yet able to identify the author independently.


2021 ◽  
Author(s):  
Petr Plecháč

The technique known as contemporary stylometry uses different methods, including machine learning, to discover a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint stylometry tends to ignore: versification, or the very making of language into verse. Using poetic texts in three different languages (Czech, German, and Spanish), Petr Plecháč asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. He then tests its findings on two unsolved literary mysteries. In the first, Plecháč distinguishes the parts of the Elizabethan verse play The Two Noble Kinsmen written by William Shakespeare from those written by his coauthor, John Fletcher. In the second, he seeks to solve a case of suspected forgery: how authentic was a group of poems first published as the work of the nineteenth-century Russian author Gavriil Stepanovich Batenkov? This book of poetic investigation should appeal to literary sleuths the world over.


Author(s):  
Kunal Parikh ◽  
Tanvi Makadia ◽  
Harshil Patel

Dengue is unquestionably one of the biggest health concerns in India and for many other developing countries. Unfortunately, many people have lost their lives because of it. Every year, approximately 390 million dengue infections occur around the world among which 500,000 people are seriously infected and 25,000 people have died annually. Many factors could cause dengue such as temperature, humidity, precipitation, inadequate public health, and many others. In this paper, we are proposing a method to perform predictive analytics on dengue’s dataset using KNN: a machine-learning algorithm. This analysis would help in the prediction of future cases and we could save the lives of many.


2018 ◽  
Vol 12 ◽  
pp. 85-98
Author(s):  
Bojan Kostadinov ◽  
Mile Jovanov ◽  
Emil STANKOV

Data collection and machine learning are changing the world. Whether it is medicine, sports or education, companies and institutions are investing a lot of time and money in systems that gather, process and analyse data. Likewise, to improve competitiveness, a lot of countries are making changes to their educational policy by supporting STEM disciplines. Therefore, it’s important to put effort into using various data sources to help students succeed in STEM. In this paper, we present a platform that can analyse student’s activity on various contest and e-learning systems, combine and process the data, and then present it in various ways that are easy to understand. This in turn enables teachers and organizers to recognize talented and hardworking students, identify issues, and/or motivate students to practice and work on areas where they’re weaker.


2021 ◽  
pp. 1-4
Author(s):  
Mathieu D'Aquin ◽  
Stefan Dietze

The 29th ACM International Conference on Information and Knowledge Management (CIKM) was held online from the 19 th to the 23 rd of October 2020. CIKM is an annual computer science conference, focused on research at the intersection of information retrieval, machine learning, databases as well as semantic and knowledge-based technologies. Since it was first held in the United States in 1992, 28 conferences have been hosted in 9 countries around the world.


Author(s):  
Salman Bin Naeem ◽  
Maged N. Kamel Boulos

Low digital health literacy affects large percentages of populations around the world and is a direct contributor to the spread of COVID-19-related online misinformation (together with bots). The ease and ‘viral’ nature of social media sharing further complicate the situation. This paper provides a quick overview of the magnitude of the problem of COVID-19 misinformation on social media, its devastating effects, and its intricate relation to digital health literacy. The main strategies, methods and services that can be used to detect and prevent the spread of COVID-19 misinformation, including machine learning-based approaches, health literacy guidelines, checklists, mythbusters and fact-checkers, are then briefly reviewed. Given the complexity of the COVID-19 infodemic, it is very unlikely that any of these approaches or tools will be fully effective alone in stopping the spread of COVID-19 misinformation. Instead, a mixed, synergistic approach, combining the best of these strategies, methods, and services together, is highly recommended in tackling online health misinformation, and mitigating its negative effects in COVID-19 and future pandemics. Furthermore, techniques and tools should ideally focus on evaluating both the message (information content) and the messenger (information author/source) and not just rely on assessing the latter as a quick and easy proxy for the trustworthiness and truthfulness of the former. Surveying and improving population digital health literacy levels are also essential for future infodemic preparedness.


2021 ◽  
Vol 1 ◽  
pp. 1755-1764
Author(s):  
Rongyan Zhou ◽  
Julie Stal-Le Cardinal

Abstract Industry 4.0 is a great opportunity and a tremendous challenge for every role of society. Our study combines complex network and qualitative methods to analyze the Industry 4.0 macroeconomic issues and global supply chain, which enriches the qualitative analysis and machine learning in macroscopic and strategic research. Unsupervised complex graph network models are used to explore how industry 4.0 reshapes the world. Based on the in-degree and out-degree of the weighted and unweighted edges of each node, combined with the grouping results based on unsupervised learning, our study shows that the cooperation groups of Industry 4.0 are different from the previous traditional alliances. Macroeconomics issues also are studied. Finally, strong cohesive groups and recommendations for businessmen and policymakers are proposed.


Sign in / Sign up

Export Citation Format

Share Document