scholarly journals Comparing tagging suggestion models on discrete corpora

2020 ◽  
Vol 16 (2) ◽  
pp. 201-221
Author(s):  
Bojan Bozic ◽  
Andre Rios ◽  
Sarah Jane Delany

Purpose This paper aims to investigate the methods for the prediction of tags on a textual corpus that describes diverse data sets based on short messages; as an example, the authors demonstrate the usage of methods based on hotel staff inputs in a ticketing system as well as the publicly available StackOverflow corpus. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry. Design/methodology/approach The paper consists of two parts: exploration of existing sample data, which includes statistical analysis and visualisation of the data to provide an overview, and evaluation of tag prediction approaches. The authors have included different approaches from different research fields to cover a broad spectrum of possible solutions. As a result, the authors have tested a machine learning model for multi-label classification (using gradient boosting), a statistical approach (using frequency heuristics) and three similarity-based classification approaches (nearest centroid, k-nearest neighbours (k-NN) and naive Bayes). The experiment that compares the approaches uses recall to measure the quality of results. Finally, the authors provide a recommendation of the modelling approach that produces the best accuracy in terms of tag prediction on the sample data. Findings The authors have calculated the performance of each method against the test data set by measuring recall. The authors show recall for each method with different features (except for frequency heuristics, which does not provide the option to add additional features) for the dmbook pro and StackOverflow data sets. k-NN clearly provides the best recall. As k-NN turned out to provide the best results, the authors have performed further experiments with values of k from 1–10. This helped us to observe the impact of the number of neighbours used on the performance and to identify the best value for k. Originality/value The value and originality of the paper are given by extensive experiments with several methods from different domains. The authors have used probabilistic methods, such as naive Bayes, statistical methods, such as frequency heuristics, and similarity approaches, such as k-NN. Furthermore, the authors have produced results on an industrial-scale data set that has been provided by a company and used directly in their project, as well as a community-based data set with a large amount of data and dimensionality. The study results can be used to select a model based on diverse corpora for a specific use case, taking into account advantages and disadvantages when applying the model to your data.

2017 ◽  
Vol 30 (3) ◽  
pp. 235-247 ◽  
Author(s):  
Alison Leary ◽  
Barbara Tomai ◽  
Adrian Swift ◽  
Andrew Woodward ◽  
Keith Hurst

Purpose Despite the generation of mass data by the nursing workforce, determining the impact of the contribution to patient safety remains challenging. Several cross-sectional studies have indicated a relationship between staffing and safety. The purpose of this paper is to uncover possible associations and explore if a deeper understanding of relationships between staffing and other factors such as safety could be revealed within routinely collected national data sets. Design/methodology/approach Two longitudinal routinely collected data sets consisting of 30 years of UK nurse staffing data and seven years of National Health Service (NHS) benchmark data such as survey results, safety and other indicators were used. A correlation matrix was built and a linear correlation operation was applied (Pearson product-moment correlation coefficient). Findings A number of associations were revealed within both the UK staffing data set and the NHS benchmarking data set. However, the challenges of using these data sets soon became apparent. Practical implications Staff time and effort are required to collect these data. The limitations of these data sets include inconsistent data collection and quality. The mode of data collection and the itemset collected should be reviewed to generate a data set with robust clinical application. Originality/value This paper revealed that relationships are likely to be complex and non-linear; however, the main contribution of the paper is the identification of the limitations of routinely collected data. Much time and effort is expended in collecting this data; however, its validity, usefulness and method of routine national data collection appear to require re-examination.


2016 ◽  
Vol 8 (3) ◽  
pp. 282-297 ◽  
Author(s):  
Faizul Haque ◽  
Rehnuma Shahid

Purpose This paper examines the effect of ownership structure on bank risk-taking and performance in emerging economies by using India as a case study. Design/methodology/approach We use generalised method of moments (GMM) estimation technique to analyse an unbalanced panel data set covering 217 bank-year observations from 2008 to 2011. Findings Overall, our study results suggest that government ownership is positively associated with default risk and negatively related to bank profitability. Interestingly, we find foreign ownership having a positive effect on default risk and a negative effect on profitability among the listed commercial banks. The effect of ownership concentration on bank risk-taking and profitability appears to be statistically insignificant. Originality/value This study is among the first to consider the impact of ownership on bank risk-taking and profitability from an emerging economy perspective. It also addresses the problem of endogenous relationships among ownership, risk-taking and performance of a bank. This study is likely to have implications for policymakers in undertaking regulatory reforms relating to ownership, risk management and banking sector stability.


Author(s):  
Athanasios Theofilatos ◽  
Cong Chen ◽  
Constantinos Antoniou

Although there are numerous studies examining the impact of real-time traffic and weather parameters on crash occurrence on freeways, to the best of the authors’ knowledge there are no studies which have compared the prediction performances of machine learning (ML) and deep learning (DL) models. The present study adds to current knowledge by comparing and validating ML and DL methods to predict real-time crash occurrence. To achieve this, real-time traffic and weather data from Attica Tollway in Greece were linked with historical crash data. The total data set was split into training/estimation (75%) and validation (25%) subsets, which were then standardized. First, the ML and DL prediction models were trained/estimated using the training data set. Afterwards, the models were compared on the basis of their performance metrics (accuracy, sensitivity, specificity, and area under curve, or AUC) on the test set. The models considered were k-nearest neighbor, Naïve Bayes, decision tree, random forest, support vector machine, shallow neural network, and, lastly, deep neural network. Overall, the DL model seems to be more appropriate, because it outperformed all other candidate models. More specifically, the DL model managed to achieve a balanced performance among all metrics compared with other models (total accuracy = 68.95%, sensitivity = 0.521, specificity = 0.77, AUC = 0.641). It is surprising though that the Naïve Bayes model achieved a good performance despite being far less complex than other models. The study findings are particularly useful, because they provide a first insight into performance of ML and DL models.


2020 ◽  
Vol 7 (1) ◽  
pp. 15
Author(s):  
Fakhriza Firdaus ◽  
Ali Mukhlis

A number of studies about bankruptcy prediction have widely applied the Data Mining technique to find useful knowledge automatically based on an assessment of the management's assessment of the risks that exist in a company. In the process of risk assessment the actual knowledge of experts is still considered an important task because the predictions of experts depend on their effectiveness. This study aims to extract information from qualitative bankruptcy data sets so that they can be used as a useful learning resource for improving the management of a company. The technique used in this study is classification using the Naive Bayes algorithm. Naive Bayes uses probabilistic predictions to classify data.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Quentin Grossetti ◽  
Cedric du Mouza ◽  
Nicolas Travers ◽  
Camelia Constantin

Purpose Social network platforms are considered today as a major communication mean. Their success leads to an unprecedented growth of user-generated content; therefore, finding interesting content for a given user has become a major issue. Recommender systems allow these platforms to personalize individual experience and increase user engagement by filtering messages according to user interest and/or neighborhood. Recent research results show, however, that this content personalization might increase the echo chamber effect and create filter bubbles that restrain the diversity of opinions regarding the recommended content. Design/methodology/approach The purpose of this paper is to present a thorough study of communities on a large Twitter data set that quantifies the effect of recommender systems on users’ behavior by creating filter bubbles. The authors further propose their community-aware model (CAM) that counters the impact of different recommender systems on information consumption. Findings The authors propose their CAM that counters the impact of different recommender systems on information consumption. The study results show that filter bubbles effects concern up to 10% of users and the proposed model based on the similarities between communities enhance recommendations. Originality/value The authors proposed the CAM approach, which relies on similarities between communities to re-rank lists of recommendations to weaken the filter bubble effect for these users.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2015 ◽  
Vol 42 (12) ◽  
pp. 1071-1089
Author(s):  
Alan Chan ◽  
Bruce G. Fawcett ◽  
Shu-Kam Lee

Purpose – Church giving and attendance are two important indicators of church health and performance. In the literature, they are usually understood to be simultaneously determined. The purpose of this paper is to estimate if there a sustainable church congregation size using Wintrobe’s (1998) dictatorship model. The authors want to examine the impact of youth and adult ministry as well. Design/methodology/approach – Using the data collected from among Canadian Baptist churches in Eastern Canada, this study investigates the factors affecting the level of the two indicators by the panel-instrumental variable technique. Applying Wintrobe’s (1998) political economy model on dictatorship, the equilibrium level of worship attendance and giving is predicted. Findings – Through various simulation exercises, the actual church congregation sizes is approximately 50 percent of the predicted value, implying inefficiency and misallocation of church resources. The paper concludes with insights on effective ways church leaders can allocate scarce resources to promote growth within churches. Originality/value – The authors are the only researchers getting the permission from the Atlantic Canada Baptist Convention to use their mega data set on church giving and congregation sizes as per the authors’ knowledge. The authors are also applying a theoretical model on dictatorship to religious/not for profits organizations.


2018 ◽  
Vol 10 (4) ◽  
pp. 467-477 ◽  
Author(s):  
Matvey S. Oborin ◽  
Irina Kozhushkina ◽  
Tatyana Gvarliani ◽  
Nikolay Ivanov

Purpose This paper aims to analyze the modern problems and the main trends of development of the health-improving tourism sector in the southern part of Russia and to identify significant factors in overcoming the complex challenges related to specific socio-economic conditions in the study area. Design/methodology/approach The material that served as the basis of the study comprises statistical data from the Southern Federal District and its subjects, as well as data about the development of tourism infrastructure on the official websites of governments, Ministry of Tourism and the population of the Southern Federal District. This information was systematized from a number of perspectives, including identification of the chronology of health-improving tourism infrastructure development in the chosen territory, as well as the advantages and disadvantages in this area. Based on the results of the study, the authors also developed some recommendations to overcome existing inactive trends in the field of health tourism. Findings This paper sheds light on the understanding of the challenges and changes that took place in the resort agglomerations of the south of Russia in terms of current issues and those that must be addressed in the coming years. It was concluded that health tourism in the south of Russia has old traditions based on the natural resource potential of territories that are included in the composition of the Southern Federal District. At the same time, the authors came to the conclusion that, unfortunately, not resort agglomerations are fully utilized. Furthermore, some historic resorts were not well maintained by local authorities and have suffered more recently because of lack of investment. At present, the financial results of health resorts and others related to health-improving tourism are precarious as most operations are unprofitable, and so complex decisions are needed to address the underlying problem of resource optimization because of the important social and economic role of the cities in this region. They have special natural and resource potential and preserve traditions related to health-improving tourism. Research limitations/implications The paper provides a conceptual analysis based on limited empirical data combined with some directions for further research. Originality/value The paper attempts to reveal the impact of social, economic and geopolitical factors, both negative and positive, on the development of the health-improving tourism segment, restructuring of the Russian tourism market and the emergence of promising opportunities and new directions for development. The findings also provide insights for practitioners and researchers, and the tourism industry can draw on this analysis to guide the development of strategy, increase investment attractiveness, make more effective use of the natural resource potential and maintain pressure on government partners to provide support to tourism.


Sign in / Sign up

Export Citation Format

Share Document