Can Linguistic Predictors Detect Fraudulent Financial Filings?

2010 ◽  
Vol 7 (1) ◽  
pp. 25-46 ◽  
Author(s):  
Sunita Goel ◽  
Jagdish Gangolly ◽  
Sue R. Faerman ◽  
Ozlem Uzuner

ABSTRACT: Extensive research has been done on the analytical and empirical examination of financial data in annual reports to detect fraud; however, there is scant research on the analysis of text in annual reports to detect fraud. The basic premise of this research is that there are clues hidden in the text that can be detected to determine the likelihood of fraud. In this research, we examine both the verbal content and the presentation style of the qualitative portion of the annual reports using natural language processing tools and explore linguistic features that distinguish fraudulent annual reports from nonfraudulent annual reports. Our results indicate that employment of linguistic features is an effective means for detecting fraud. We were able to improve the prediction accuracy of our fraud detection model from initial baseline results of 56.75 percent accuracy, using a “bag of words” approach, to 89.51 percent accuracy when we incorporated linguistically motivated features inspired by our informed reasoning and domain knowledge.

Many researches have been done on annual reports to detect whether it is fraud or not by the analytical and empirical part of the report. Annual reports provide information on a company’s activities throughout a year. By analyzing the annual report, we can identify the condition of the company whether it is in crisis or operating perfectly. This research deals with the data that can be obtained from the reports’ text to determine the probability of being a fraudulent annual report. The verbal content of the report which determines the linguistic features are being analyzed using natural language processing tools to distinguish fraud financial reports from non-fraud financial reports. A set of 60 annual reports were taken for the study. Out of which 30 annual reports are labelled as fraud and the other 30 is labelled as non-fraud. The set of fraudulent companies were selected on the basis of a reporting case of fraudulency of another company or the same company in any other year of non-reporting of cases. The features are selected using a wrapper method search algorithm. A neural network model of MLP (Multi-Layer Perceptron) algorithm is used to classify the data with an accuracy of 85.1%. Classifiers like SVM (Support Vector Machines), Logistic Regression, Naïve Bayes and Random Forest algorithms were also used to identify the best classifier out of all the algorithms. Performance of all the techniques used in this paper are being analyzed and presented in terms of accuracy, precision, recall, F1 score, TN rate and FN rate.


2021 ◽  
Vol 33 (5) ◽  
pp. 42-73
Author(s):  
Mohammad Kamel Daradkeh

With the proliferation of big data and business analytics practices, data storytelling has gained increasing importance as an effective means for communicating analytical insights to the target audience to support decision-making and improve business performance. However, there is a limited empirical understanding of the relationship between data storytelling competency, decision-making quality, and business performance. Drawing on the resource-based view (RBV), this study develops and validates the concept of data storytelling competency as a multidimensional construct consisting of data quality, story quality, storytelling tool quality, storyteller skills, and storyteller domain knowledge. It also develops a mediation model to examine the relationship between data storytelling competency and business performance, and whether this relationship is mediated by decision-making quality. Based on an empirical analysis of data collected from business analytics practitioners, the results of this study reveal that the data storytelling competency is positively linked to business performance, which is partially mediated by decision-making quality. These results provide a theoretical basis for further investigation of possible antecedents and consequences of data storytelling competency. They also offer guidance for practitioners on how to leverage data storytelling capabilities in business analytics practices to improve decision-making and business performance.


2021 ◽  
Vol 13 ◽  
Author(s):  
Aparna Balagopalan ◽  
Benjamin Eyre ◽  
Jessica Robin ◽  
Frank Rudzicz ◽  
Jekaterina Novikova

Introduction: Research related to the automatic detection of Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional diagnostic methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing, and machine learning provide promising techniques for reliably detecting AD. There has been a recent proliferation of classification models for AD, but these vary in the datasets used, model types and training and testing paradigms. In this study, we compare and contrast the performance of two common approaches for automatic AD detection from speech on the same, well-matched dataset, to determine the advantages of using domain knowledge vs. pre-trained transfer models.Methods: Audio recordings and corresponding manually-transcribed speech transcripts of a picture description task administered to 156 demographically matched older adults, 78 with Alzheimer's Disease (AD) and 78 cognitively intact (healthy) were classified using machine learning and natural language processing as “AD” or “non-AD.” The audio was acoustically-enhanced, and post-processed to improve quality of the speech recording as well control for variation caused by recording conditions. Two approaches were used for classification of these speech samples: (1) using domain knowledge: extracting an extensive set of clinically relevant linguistic and acoustic features derived from speech and transcripts based on prior literature, and (2) using transfer-learning and leveraging large pre-trained machine learning models: using transcript-representations that are automatically derived from state-of-the-art pre-trained language models, by fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models.Results: We compared the utility of speech transcript representations obtained from recent natural language processing models (i.e., BERT) to more clinically-interpretable language feature-based methods. Both the feature-based approaches and fine-tuned BERT models significantly outperformed the baseline linguistic model using a small set of linguistic features, demonstrating the importance of extensive linguistic information for detecting cognitive impairments relating to AD. We observed that fine-tuned BERT models numerically outperformed feature-based approaches on the AD detection task, but the difference was not statistically significant. Our main contribution is the observation that when tested on the same, demographically balanced dataset and tested on independent, unseen data, both domain knowledge and pretrained linguistic models have good predictive performance for detecting AD based on speech. It is notable that linguistic information alone is capable of achieving comparable, and even numerically better, performance than models including both acoustic and linguistic features here. We also try to shed light on the inner workings of the more black-box natural language processing model by performing an interpretability analysis, and find that attention weights reveal interesting patterns such as higher attribution to more important information content units in the picture description task, as well as pauses and filler words.Conclusion: This approach supports the value of well-performing machine learning and linguistically-focussed processing techniques to detect AD from speech and highlights the need to compare model performance on carefully balanced datasets, using consistent same training parameters and independent test datasets in order to determine the best performing predictive model.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Aoshuang Ye ◽  
Lina Wang ◽  
Run Wang ◽  
Wenqi Wang ◽  
Jianpeng Ke ◽  
...  

The social network has become the primary medium of rumor propagation. Moreover, manual identification of rumors is extremely time-consuming and laborious. It is crucial to identify rumors automatically. Machine learning technology is widely implemented in the identification and detection of misinformation on social networks. However, the traditional machine learning methods profoundly rely on feature engineering and domain knowledge, and the learning ability of temporal features is insufficient. Furthermore, the features used by the deep learning method based on natural language processing are heavily limited. Therefore, it is of great significance and practical value to study the rumor detection method independent of feature engineering and effectively aggregate heterogeneous features to adapt to the complex and variable social network. In this paper, a deep neural network- (DNN-) based feature aggregation modeling method is proposed, which makes full use of the knowledge of propagation pattern feature and text content feature of social network event without feature engineering and domain knowledge. The experimental results show that the feature aggregation model has achieved 94.4% of accuracy as the best performance in recent works.


Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.


2021 ◽  
pp. 016555152110077
Author(s):  
Sulong Zhou ◽  
Pengyu Kan ◽  
Qunying Huang ◽  
Janet Silbernagel

Natural disasters cause significant damage, casualties and economical losses. Twitter has been used to support prompt disaster response and management because people tend to communicate and spread information on public social media platforms during disaster events. To retrieve real-time situational awareness (SA) information from tweets, the most effective way to mine text is using natural language processing (NLP). Among the advanced NLP models, the supervised approach can classify tweets into different categories to gain insight and leverage useful SA information from social media data. However, high-performing supervised models require domain knowledge to specify categories and involve costly labelling tasks. This research proposes a guided latent Dirichlet allocation (LDA) workflow to investigate temporal latent topics from tweets during a recent disaster event, the 2020 Hurricane Laura. With integration of prior knowledge, a coherence model, LDA topics visualisation and validation from official reports, our guided approach reveals that most tweets contain several latent topics during the 10-day period of Hurricane Laura. This result indicates that state-of-the-art supervised models have not fully utilised tweet information because they only assign each tweet a single label. In contrast, our model can not only identify emerging topics during different disaster events but also provides multilabel references to the classification schema. In addition, our results can help to quickly identify and extract SA information to responders, stakeholders and the general public so that they can adopt timely responsive strategies and wisely allocate resource during Hurricane events.


2021 ◽  
pp. 1063293X2098297
Author(s):  
Ivar Örn Arnarsson ◽  
Otto Frost ◽  
Emil Gustavsson ◽  
Mats Jirstrand ◽  
Johan Malmqvist

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.


2015 ◽  
Vol 53 (5) ◽  
pp. 932-956 ◽  
Author(s):  
Han Lin ◽  
Saixing Zeng ◽  
Hanyang Ma ◽  
Hongquan Chen

Purpose – The purpose of this paper is to develop a better understanding of the mechanisms by which symbolic commitment to self-regulation influences corporate environmental performance through the adoption of substantive actions. Design/methodology/approach – Using a sample of Chinese listed private firms in manufacturing sectors, this paper empirically investigates whether and how corporate symbolic commitment to environmental self-regulation really improves the consequences of corporate activities with respect to environmental issues under the current Chinese context. A moderated mediation analysis is employed to test the hypotheses and examine the relationships proposed in the research framework. Findings – The authors argue that making a commitment to environmental self-regulation could motivate firms to implement effective means of being green. The intriguing and robust results show that firms with higher ranking environmental commitment are more likely to use political connections to obtain resources (green subsidies), and then improve environmental performance. Practical implications – The results of this study provide a snapshot of the mechanism between symbolic promises and real outcomes. Originality/value – The authors theorize about and test both direct and indirect effects of commitment to self-regulation on real outcomes which provide empirical evidence for the incipient but growing understanding of self-regulation.


Author(s):  
Ángela Almela ◽  
Gema Alcaraz-Mármol ◽  
Arancha García-Pinar ◽  
Clara Pallejá

In this paper, the methods for developing a database of Spanish writing that can be used for forensic linguistic research are presented, including our data collection procedures. Specifically, the main instrument used for data collection has been translated into Spanish and adapted from Chaski (2001). It consists of ten tasks, by means of which the subjects are asked to write formal and informal texts about different topics. To date, 93 undergraduates from Spanish universities have already participated in the study and prisoners convicted of gender-based abuse have participated. A twofold analysis has been performed, since the data collected have been approached from a semantic and a morphosyntactic perspective. Regarding the semantic analysis, psycholinguistic categories have been used, many of them taken from the LIWC dictionary (Pennebaker et al., 2001). In order to obtain a more comprehensive depiction of the linguistic data, some other ad-hoc categories have been created, based on the corpus itself, using a double-check method for their validation so as to ensure inter-rater reliability. Furthermore, as regards morphosyntactic analysis, the natural language processing tool ALIAS TATTLER is being developed for Spanish.  Results shows that is it possible to differentiate non-abusers from abusers with strong accuracy based on linguistic features.


Data ◽  
2020 ◽  
Vol 5 (3) ◽  
pp. 60
Author(s):  
Nasser Alshammari ◽  
Saad Alanazi

This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, five linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the definite article, are provided for each record in the dataset.


Sign in / Sign up

Export Citation Format

Share Document