Stress Detection using Natural Language Processing and Machine Learning over social Interactions

Abstract Cyberspace is a vast soapbox for people to post anything that they witness in their day-to-day lives. Subsequently, it can be used as a very effective tool in detecting the stress levels of an individual based on the posts and comments shared by him/her on social networking platforms. We leverage large-scale datasets with tweets to successfully accomplish sentiment analysis with the aid of machine learning algorithms. We take the help of a capable deep learning pre-trained model called BERT to solve the problems which come with sentiment classification. The BERT model outperforms a lot of other well-known models for this job without any sophisticated architecture. We also adopted Latent Dirichlet Allocation which is an unsupervised machine learning method that’s skilled in scanning a group of documents, recognizing the word and phrase patterns within them, and gathering word groups and alike expressions that most precisely illustrate a set of documents. This helps us predict which topic is linked to the textual data. With the aid of the models suggested, we will be able to detect the emotion of users online. We are primarily working with Twitter data because Twitter is a website where people express their thoughts often. In conclusion, this proposal is for the well- being of one’s mental health. The results are evaluated using various metric at macro and micro level and indicate that the trained model detects the status of emotions bases on social interactions.

Download Full-text

SALTClass: classifying clinical short notes using background knowledge from unlabeled data

10.1101/801944 ◽

2019 ◽

Author(s):

Ayoub Bagheri ◽

Daniel Oberski ◽

Arjan Sammani ◽

Peter G.M. van der Heijden ◽

Folkert W. Asselbergs

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Unlabeled Data ◽

Specific Information ◽

Short Text ◽

Link Type ◽

Python Package

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.

Download Full-text

Implementing Artificial Intelligence in Predicting Metrics for Characterizing Laser Propagation in Atmospheric Turbulence

Journal of Fluids Engineering ◽

10.1115/1.4043706 ◽

2019 ◽

Vol 141 (12) ◽

Author(s):

Diego Alberto Lozano Jimenez ◽

V. M.Krushnarao Kotteda ◽

Vinod Kumar ◽

V. S. Rao Gudimetla

Keyword(s):

Machine Learning ◽

Laser Beam ◽

Atmospheric Turbulence ◽

Language Processing ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Outer Scale ◽

Laser Propagation ◽

Turbulence Effects

The effects of a laser beam propagating through atmospheric turbulence are investigated using the phase screen approach. Turbulence effects are modeled by the Kolmogorov description of the energy cascade theory, and outer scale effect is implemented by the von Kármán refractive power spectral density. In this study, we analyze a plane wave propagating through varying atmospheric horizontal paths. An important consideration for the laser beam propagation of long distances is the random variations in the refractive index due to atmospheric turbulence. To characterize the random behavior, statistical analysis of the phase data and related metrics are examined at the output signal. We train three different machine learning algorithms in tensorflow library with the data at varying propagation lengths, outer scale lengths, and levels of turbulence intensity to predict statistical parameters that describe the atmospheric turbulence effects on laser propagation. tensorflow is an interface for demonstrating machine learning algorithms and an implementation for executing such algorithms on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets to large-scale distributed systems and thousands of computational devices such as GPU cards. The library contains a wide variety of algorithms including training and inference algorithms for deep neural network models. Therefore, it has been used for deploying machine learning systems in many fields including speech recognition, computer vision, natural language processing, and text mining.

Download Full-text

A Systematic Literature Review on Using Machine Learning Algorithms for Software Requirements Identification on Stack Overflow

Security and Communication Networks ◽

10.1155/2020/8830683 ◽

2020 ◽

Vol 2020 ◽

pp. 1-19

Author(s):

Arshad Ahmad ◽

Chong Feng ◽

Muzammil Khan ◽

Asif Khan ◽

Ayaz Ullah ◽

...

Keyword(s):

Machine Learning ◽

Literature Review ◽

Language Processing ◽

Systematic Literature Review ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Software Requirements ◽

Quality Systems ◽

Stack Overflow ◽

Requirements Identification

Context. The improvements made in the last couple of decades in the requirements engineering (RE) processes and methods have witnessed a rapid rise in effectively using diverse machine learning (ML) techniques to resolve several multifaceted RE issues. One such challenging issue is the effective identification and classification of the software requirements on Stack Overflow (SO) for building quality systems. The appropriateness of ML-based techniques to tackle this issue has revealed quite substantial results, much effective than those produced by the usual available natural language processing (NLP) techniques. Nonetheless, a complete, systematic, and detailed comprehension of these ML based techniques is considerably scarce. Objective. To identify or recognize and classify the kinds of ML algorithms used for software requirements identification primarily on SO. Method. This paper reports a systematic literature review (SLR) collecting empirical evidence published up to May 2020. Results. This SLR study found 2,484 published papers related to RE and SO. The data extraction process of the SLR showed that (1) Latent Dirichlet Allocation (LDA) topic modeling is among the widely used ML algorithm in the selected studies and (2) precision and recall are amongst the most commonly utilized evaluation methods for measuring the performance of these ML algorithms. Conclusion. Our SLR study revealed that while ML algorithms have phenomenal capabilities of identifying the software requirements on SO, they still are confronted with various open problems/issues that will eventually limit their practical applications and performances. Our SLR study calls for the need of close collaboration venture between the RE and ML communities/researchers to handle the open issues confronted in the development of some real world machine learning-based quality systems.

Download Full-text

Administrative Law in the Automated State

Daedalus ◽

10.1162/daed_a_01862 ◽

2021 ◽

Vol 150 (3) ◽

pp. 104-120

Author(s):

Cary Coglianese

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Algorithms ◽

Administrative Law ◽

Machine Learning Algorithms ◽

Democratic Accountability ◽

Mathematical Properties ◽

The Status ◽

Scale Shift ◽

Data Driven Decisions

Abstract In the future, administrative agencies will rely increasingly on digital automation powered by machine learning algorithms. Can U.S. administrative law accommodate such a future? Not only might a highly automated state readily meet long-standing administrative law principles, but the responsible use of machine learning algorithms might perform even better than the status quo in terms of fulfilling administrative law's core values of expert decision-making and democratic accountability. Algorithmic governance clearly promises more accurate, data-driven decisions. Moreover, due to their mathematical properties, algorithms might well prove to be more faithful agents of democratic institutions. Yet even if an automated state were smarter and more accountable, it might risk being less empathic. Although the degree of empathy in existing human-driven bureaucracies should not be overstated, a large-scale shift to government by algorithm will pose a new challenge for administrative law: ensuring that an automated state is also an empathic one.

Download Full-text

Machine learning in medicine: a practical introduction to natural language processing

BMC Medical Research Methodology ◽

10.1186/s12874-021-01347-1 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Conrad J. Harrison ◽

Chris J. Sidey-Gibbons

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Mental Health Problems ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector

Abstract Background Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software. Methods We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity. Results Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM. Conclusions In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

Download Full-text

Efficient Image Retrieval approach for Large-scale Chest X Ray data using Hand-Crafted Features and Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i11.890896 ◽

2018 ◽

Vol 6 (11) ◽

pp. 890-896

Author(s):

Irene Getzi S ◽

D. Christopher Durairaj ◽

V Joseph Raj

Keyword(s):

Machine Learning ◽

Image Retrieval ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

X Ray ◽

Chest X Ray

Download Full-text

Automated Data-Driven Generation of Personalized Pedagogical Interventions in Intelligent Tutoring Systems

International Journal of Artificial Intelligence in Education ◽

10.1007/s40593-021-00267-x ◽

2021 ◽

Author(s):

Ekaterina Kochmar ◽

Dung Do Vu ◽

Robert Belfer ◽

Varun Gupta ◽

Iulian Vlad Serban ◽

...

Keyword(s):

Machine Learning ◽

Student Performance ◽

Language Processing ◽

Intelligent Tutoring Systems ◽

Large Scale ◽

Intelligent Tutoring ◽

Performance Outcomes ◽

Data Driven ◽

Personalized Feedback ◽

Tutoring Systems

AbstractIntelligent tutoring systems (ITS) have been shown to be highly effective at promoting learning as compared to other computer-based instructional approaches. However, many ITS rely heavily on expert design and hand-crafted rules. This makes them difficult to build and transfer across domains and limits their potential efficacy. In this paper, we investigate how feedback in a large-scale ITS can be automatically generated in a data-driven way, and more specifically how personalization of feedback can lead to improvements in student performance outcomes. First, in this paper we propose a machine learning approach to generate personalized feedback in an automated way, which takes individual needs of students into account, while alleviating the need of expert intervention and design of hand-crafted rules. We leverage state-of-the-art machine learning and natural language processing techniques to provide students with personalized feedback using hints and Wikipedia-based explanations. Second, we demonstrate that personalized feedback leads to improved success rates at solving exercises in practice: our personalized feedback model is used in , a large-scale dialogue-based ITS with around 20,000 students launched in 2019. We present the results of experiments with students and show that the automated, data-driven, personalized feedback leads to a significant overall improvement of 22.95% in student performance outcomes and substantial improvements in the subjective evaluation of the feedback.

Download Full-text

Clinician checklist for assessing suitability of machine learning applications in healthcare

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100251 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100251

Author(s):

Ian Scott ◽

Stacey Carter ◽

Enrico Coiera

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clinical Decision Making ◽

Improve Patient Care ◽

Clinical Decision ◽

Routine Care ◽

Machine Learning Algorithms ◽

Clinical Settings ◽

Machine Learning Applications ◽

Key Issues

Machine learning algorithms are being used to screen and diagnose disease, prognosticate and predict therapeutic responses. Hundreds of new algorithms are being developed, but whether they improve clinical decision making and patient outcomes remains uncertain. If clinicians are to use algorithms, they need to be reassured that key issues relating to their validity, utility, feasibility, safety and ethical use have been addressed. We propose a checklist of 10 questions that clinicians can ask of those advocating for the use of a particular algorithm, but which do not expect clinicians, as non-experts, to demonstrate mastery over what can be highly complex statistical and computational concepts. The questions are: (1) What is the purpose and context of the algorithm? (2) How good were the data used to train the algorithm? (3) Were there sufficient data to train the algorithm? (4) How well does the algorithm perform? (5) Is the algorithm transferable to new clinical settings? (6) Are the outputs of the algorithm clinically intelligible? (7) How will this algorithm fit into and complement current workflows? (8) Has use of the algorithm been shown to improve patient care and outcomes? (9) Could the algorithm cause patient harm? and (10) Does use of the algorithm raise ethical, legal or social concerns? We provide examples where an algorithm may raise concerns and apply the checklist to a recent review of diagnostic imaging applications. This checklist aims to assist clinicians in assessing algorithm readiness for routine care and identify situations where further refinement and evaluation is required prior to large-scale use.

Download Full-text

57 Precision neoantigen discovery using novel algorithms and expanded HLA-ligandome datasets

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0057 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A62-A62

Author(s):

Dattatreya Mellacheruvu ◽

Rachel Pyke ◽

Charles Abbott ◽

Nick Phillips ◽

Sejal Desai ◽

...

Keyword(s):

Machine Learning ◽

Cell Lines ◽

Antigen Processing ◽

Large Scale ◽

Prediction Models ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Training Data ◽

High Quality ◽

Tissue Samples

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.

Download Full-text

Machine Learning for Dissimulating Reality

Proceedings ◽

10.3390/proceedings2021077017 ◽

2021 ◽

Vol 77 (1) ◽

pp. 17

Author(s):

Andrea Giussani

Keyword(s):

Machine Learning ◽

Language Processing ◽

Huge Amount ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Technological Advances ◽

Textual Data ◽

Musical Scores ◽

Mathematical Formulas

In the last decade, advances in statistical modeling and computer science have boosted the production of machine-produced contents in different fields: from language to image generation, the quality of the generated outputs is remarkably high, sometimes better than those produced by a human being. Modern technological advances such as OpenAI’s GPT-2 (and recently GPT-3) permit automated systems to dramatically alter reality with synthetic outputs so that humans are not able to distinguish the real copy from its counteracts. An example is given by an article entirely written by GPT-2, but many other examples exist. In the field of computer vision, Nvidia’s Generative Adversarial Network, commonly known as StyleGAN (Karras et al. 2018), has become the de facto reference point for the production of a huge amount of fake human face portraits; additionally, recent algorithms were developed to create both musical scores and mathematical formulas. This presentation aims to stimulate participants on the state-of-the-art results in this field: we will cover both GANs and language modeling with recent applications. The novelty here is that we apply a transformer-based machine learning technique, namely RoBerta (Liu et al. 2019), to the detection of human-produced versus machine-produced text concerning fake news detection. RoBerta is a recent algorithm that is based on the well-known Bidirectional Encoder Representations from Transformers algorithm, known as BERT (Devlin et al. 2018); this is a bi-directional transformer used for natural language processing developed by Google and pre-trained over a huge amount of unlabeled textual data to learn embeddings. We will then use these representations as an input of our classifier to detect real vs. machine-produced text. The application is demonstrated in the presentation.

Download Full-text