scholarly journals Considering Performance in the Automated and Manual Coding of Sociolinguistic Variables: Lessons From Variable (ING)

2021 ◽  
Vol 4 ◽  
Author(s):  
Tyler Kendall ◽  
Charlotte Vaughn ◽  
Charlie Farrington ◽  
Kaylynn Gunter ◽  
Jaidan McLean ◽  
...  

Impressionistic coding of sociolinguistic variables like English (ING), the alternation between pronunciations like talkin' and talking, has been a central part of the analytic workflow in studies of language variation and change for over a half-century. Techniques for automating the measurement and coding for a wide range of sociolinguistic data have been on the rise over recent decades but procedures for coding some features, especially those without clearly defined acoustic correlates like (ING), have lagged behind others, such as vowels and sibilants. This paper explores computational methods for automatically coding variable (ING) in speech recordings, examining the use of automatic speech recognition procedures related to forced alignment (using the Montreal Forced Aligner) as well as supervised machine learning algorithms (linear and radial support vector machines, and random forests). Considering the automated coding of pronunciation variables like (ING) raises broader questions for sociolinguistic methods, such as how much different human analysts agree in their impressionistic codes for such variables and what data might act as the “gold standard” for training and testing of automated procedures. This paper explores several of these considerations in automated, and manual, coding of sociolinguistic variables and provides baseline performance data for automated and manual coding methods. We consider multiple ways of assessing algorithms' performance, including agreement with human coders, as well as the impact on the outcome of an analysis of (ING) that includes linguistic and social factors. Our results show promise for automated coding methods but also highlight that variability in results should be expected even with careful human coded data. All data for our study come from the public Corpus of Regional African American Language and code and derivative datasets (including our hand-coded data) are available with the paper.

2021 ◽  
Vol 11 (10) ◽  
pp. 4443
Author(s):  
Rokas Štrimaitis ◽  
Pavel Stefanovič ◽  
Simona Ramanauskaitė ◽  
Asta Slotkienė

Financial area analysis is not limited to enterprise performance analysis. It is worth analyzing as wide an area as possible to obtain the full impression of a specific enterprise. News website content is a datum source that expresses the public’s opinion on enterprise operations, status, etc. Therefore, it is worth analyzing the news portal article text. Sentiment analysis in English texts and financial area texts exist, and are accurate, the complexity of Lithuanian language is mostly concentrated on sentiment analysis of comment texts, and does not provide high accuracy. Therefore in this paper, the supervised machine learning model was implemented to assign sentiment analysis on financial context news, gathered from Lithuanian language websites. The analysis was made using three commonly used classification algorithms in the field of sentiment analysis. The hyperparameters optimization using the grid search was performed to discover the best parameters of each classifier. All experimental investigations were made using the newly collected datasets from four Lithuanian news websites. The results of the applied machine learning algorithms show that the highest accuracy is obtained using a non-balanced dataset, via the multinomial Naive Bayes algorithm (71.1%). The other algorithm accuracies were slightly lower: a long short-term memory (71%), and a support vector machine (70.4%).


Electronics ◽  
2021 ◽  
Vol 10 (14) ◽  
pp. 1701
Author(s):  
Theodor Panagiotakopoulos ◽  
Sotiris Kotsiantis ◽  
Georgios Kostopoulos ◽  
Omiros Iatrellis ◽  
Achilles Kameas

Over recent years, massive open online courses (MOOCs) have gained increasing popularity in the field of online education. Students with different needs and learning specificities are able to attend a wide range of specialized online courses offered by universities and educational institutions. As a result, large amounts of data regarding students’ demographic characteristics, activity patterns, and learning performances are generated and stored in institutional repositories on a daily basis. Unfortunately, a key issue in MOOCs is low completion rates, which directly affect student success. Therefore, it is of utmost importance for educational institutions and faculty members to find more effective practices and reduce non-completer ratios. In this context, the main purpose of the present study is to employ a plethora of state-of-the-art supervised machine learning algorithms for predicting student dropout in a MOOC for smart city professionals at an early stage. The experimental results show that accuracy exceeds 96% based on data collected during the first week of the course, thus enabling effective intervention strategies and support actions.


Author(s):  
Noor Asyikin Sulaiman ◽  
Md Pauzi Abdullah ◽  
Hayati Abdullah ◽  
Muhammad Noorazlan Shah Zainudin ◽  
Azdiana Md Yusop

Air conditioning system is a complex system and consumes the most energy in a building. Any fault in the system operation such as cooling tower fan faulty, compressor failure, damper stuck, etc. could lead to energy wastage and reduction in the system’s coefficient of performance (COP). Due to the complexity of the air conditioning system, detecting those faults is hard as it requires exhaustive inspections. This paper consists of two parts; i) to investigate the impact of different faults related to the air conditioning system on COP and ii) to analyse the performances of machine learning algorithms to classify those faults. Three supervised learning classifier models were developed, which were deep learning, support vector machine (SVM) and multi-layer perceptron (MLP). The performances of each classifier were investigated in terms of six different classes of faults. Results showed that different faults give different negative impacts on the COP. Also, the three supervised learning classifier models able to classify all faults for more than 94%, and MLP produced the highest accuracy and precision among all.


Materials ◽  
2022 ◽  
Vol 15 (2) ◽  
pp. 647
Author(s):  
Meijun Shang ◽  
Hejun Li ◽  
Ayaz Ahmad ◽  
Waqas Ahmad ◽  
Krzysztof Adam Ostrowski ◽  
...  

Environment-friendly concrete is gaining popularity these days because it consumes less energy and causes less damage to the environment. Rapid increases in the population and demand for construction throughout the world lead to a significant deterioration or reduction in natural resources. Meanwhile, construction waste continues to grow at a high rate as older buildings are destroyed and demolished. As a result, the use of recycled materials may contribute to improving the quality of life and preventing environmental damage. Additionally, the application of recycled coarse aggregate (RCA) in concrete is essential for minimizing environmental issues. The compressive strength (CS) and splitting tensile strength (STS) of concrete containing RCA are predicted in this article using decision tree (DT) and AdaBoost machine learning (ML) techniques. A total of 344 data points with nine input variables (water, cement, fine aggregate, natural coarse aggregate, RCA, superplasticizers, water absorption of RCA and maximum size of RCA, density of RCA) were used to run the models. The data was validated using k-fold cross-validation and the coefficient correlation coefficient (R2), mean square error (MSE), mean absolute error (MAE), and root mean square error values (RMSE). However, the model’s performance was assessed using statistical checks. Additionally, sensitivity analysis was used to determine the impact of each variable on the forecasting of mechanical properties.


2007 ◽  
Vol 6 (2) ◽  
pp. 1-17
Author(s):  
K J Raman ◽  
A Marcus

Raman and Marcus (2007) have studied the impact of Automation in Public sector Banks as per the reflections of bank customers and bank officials belong to Chennai region. Marcus (2006) studied the public sector banks with special reference to selected branches in Chennai city and the perception of customers due to inception of Information Technology in the banking sector. Customers vary in their perception on information technology. In reality, customers are not against for automation and IT inception. The main concern for them is the delay in transaction due to technical snag and the increased cost of operation due to automation. Most of the customers have accounts in the private sector banks and they are well informed about the new development and up gradation that is happening in those banks. The customers believe that crores of money is being spent by the banks in the name of developing software, training the staff in IT and in providing better ambience to keep abreast with the private banks, but the ultimate outcome of which is not noteworthy.The present study is based on the reflections of 674 bank customers of the public sector banks who have various types of bank accounts in the branches of Chennai city. Branches of public sector banks in Chennai city, consisting of 19 nationalized banks and State Bank of India with its 7 Associates were covered in the process. A wide range of customers through various domains of banking operations have been studied to identify their overall perception.


2021 ◽  
Vol 9 (1) ◽  
pp. 215-223
Author(s):  
Prateek Mishra, Dr.Anurag Sharma, Dr. Abhishek Badholia

Adverse effects can be seen in the entire body due to the major disorders known as Diabetes. The risk of dangers like diabetic nephropathy, cardiac stroke and other disorders can increase severally because of the undiagnosed diabetes. Around the globe the people are suffering from this disease. For a healthy life early detection of this disease is very curtail. As the causes of the diabetes is increasing rapidly this disease might turn up as a reason for worldwide concern. Increasing the chances for a more accurate predictions and form experiences automatic learning by computational method may be provided by Machine Learning (ML). With the help of R data manipulation tool for trends development and with risk factor patterns detection in Pima Indian diabetes technique of machine learning is been used in the current researches. With the use of R data manipulation tool analysis and development five different predictive models is done for the categorization of patients into diabetic and non- diabetic.  supervised machine learning algorithms namely multifactor dimensionality reduction (MDR), k-nearest neighbor (k-NN), artificial neural network (ANN) radial basis function (RBF) kernel support vector machine and linear kernel support vector machine (SVM-linear) are used for this purpose.


2009 ◽  
Vol 15 (2) ◽  
pp. 241-271 ◽  
Author(s):  
YAOYONG LI ◽  
KALINA BONTCHEVA ◽  
HAMISH CUNNINGHAM

AbstractSupport Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.


Author(s):  
Soma Das ◽  
Pooja Rai ◽  
Sanjay Chatterji

The tremendous increase in the growth of misinformation in news articles has the potential threat for the adverse effects on society. Hence, the detection of misinformation in news data has become an appealing research area. The task of annotating and detecting distorted news article sentences is the immediate need in this research direction. Therefore, an attempt has been made to formulate the legitimacy annotation guideline followed by annotation and detection of the legitimacy in Bengali e-papers. The sentence-level manual annotation of Bengali news has been carried out in two levels, namely “Level-1 Shallow Level Classification” and “Level-2 Deep Level Classification” based on semantic properties of Bengali sentences. The tagging of 1,300 anonymous Bengali e-paper sentences has been done using the formulated guideline-based tags for both levels. The validation of the annotation guideline has been done by applying benchmark supervised machine learning algorithms using the lexical feature, syntactic feature, domain-specific feature, and Level-2 specific feature in both levels. Performance evaluation of these classifiers is done in terms of Accuracy, Precision, Recall, and F-Measure. In both levels, Support Vector Machine outperforms other benchmark classifiers with an accuracy of 72% and 65% in Level-1 and Level-2, respectively.


2019 ◽  
Vol 1 (1) ◽  
pp. 384-399 ◽  
Author(s):  
Thais de Toledo ◽  
Nunzio Torrisi

The Distributed Network Protocol (DNP3) is predominately used by the electric utility industry and, consequently, in smart grids. The Peekaboo attack was created to compromise DNP3 traffic, in which a man-in-the-middle on a communication link can capture and drop selected encrypted DNP3 messages by using support vector machine learning algorithms. The communication networks of smart grids are a important part of their infrastructure, so it is of critical importance to keep this communication secure and reliable. The main contribution of this paper is to compare the use of machine learning techniques to classify messages of the same protocol exchanged in encrypted tunnels. The study considers four simulated cases of encrypted DNP3 traffic scenarios and four different supervised machine learning algorithms: Decision tree, nearest-neighbor, support vector machine, and naive Bayes. The results obtained show that it is possible to extend a Peekaboo attack over multiple substations, using a decision tree learning algorithm, and to gather significant information from a system that communicates using encrypted DNP3 traffic.


Author(s):  
Ruopeng Xie ◽  
Jiahui Li ◽  
Jiawei Wang ◽  
Wei Dai ◽  
André Leier ◽  
...  

Abstract Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user’s viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.


Sign in / Sign up

Export Citation Format

Share Document