scholarly journals Non-Functional Requirement Detection Using Machine Learning and Natural Language Processing

Author(s):  
Hazlina Shariff Et.al

A key aspect of software quality is when the software has been operated functionally and meets user needs. A primary concern with non-functional requirements is that they always being neglected because their information is hidden in the documents. NFR is a tacit knowledge about the system and as a human, a user usually hardly know how to describe NFR. Hence, affect the NFR to be absent during the elicitation process. The software engineer has to act proactively to demand the software quality criteria from the user so the objective of requirements can be achieved. In order to overcome these problems, we use machine learning to detect the indicator term of NFR in textual requirements so we can remind the software engineer to elicit the missing NFR.We developed a prototype tool to support our approach to classify the textual requirements and using supervised machine learning algorithms. Survey wasdone toevaluate theeffectiveness of the prototype tool in detecting the NFR.

2019 ◽  
Vol 2 (1) ◽  
Author(s):  
Ari Z. Klein ◽  
Abeed Sarker ◽  
Davy Weissenbacher ◽  
Graciela Gonzalez-Hernandez

Abstract Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.


2021 ◽  
Vol 11 (2) ◽  
pp. 15-23
Author(s):  
Sabrina Jahan Maisha ◽  
Nuren Nafisa ◽  
Abdul Kadar Muhammad Masum

We can state undoubtedly that Bangla language is rich enough to work with and implement various Natural Language Processing (NLP) tasks. Though it needs proper attention, hardly NLP field has been explored with it. In this age of digitalization, large amount of Bangla news contents are generated in online platforms. Some of the contents are inappropriate for the children or aged people. With the motivation to filter out news contents easily, the aim of this work is to perform document level sentiment analysis (SA) on Bangla online news. In this respect, the dataset is created by collecting news from online Bangla newspaper archive.  Further, the documents are manually annotated into positive and negative classes. Composite process technique of “Pipeline” class including Count Vectorizer, transformer (TF-IDF) and machine learning (ML) classifiers are employed to extract features and to train the dataset. Six supervised ML classifiers (i.e. Multinomial Naive Bayes (MNB), K-Nearest Neighbor (K-NN), Random Forest (RF), (C4.5) Decision Tree (DT), Logistic Regression (LR) and Linear Support Vector Machine (LSVM)) are used to analyze the best classifier for the proposed model. There has been very few works on SA of Bangla news. So, this work is a small attempt to contribute in this field. This model showed remarkable efficiency through better results in both the validation process of percentage split method and 10-fold cross validation. Among all six classifiers, RF has outperformed others by 99% accuracy. Even though LSVM has shown lowest accuracy of 80%, it is also considered as good output. However, this work has also exhibited surpassing outcome for recent and critical Bangla news indicating proper feature extraction to build up the model.


Author(s):  
Wai-Ling Mui ◽  
Edward P. Argenta ◽  
Teresa Quitugua ◽  
Christopher Kiley

ObjectiveThe National Biosurveillance Integration Center (NBIC) andthe Defense Threat Reduction Agency’s Chemical and BiologicalTechnologies Department (DTRA J9 CB) have partnered to co-develop the Biosurveillance Ecosystem (BSVE), an emergingcapability that aims to provide a virtual, customizable analystworkbench that integrates health and non-health data. This partnershippromotes engagement between diverse health surveillance entities toincrease awareness and improve decision-making capabilities.IntroductionNBIC collects, analyzes, and shares key biosurveillanceinformation to support the nation’s response to biological events ofconcern. Integration of this information enables early warning andshared situational awareness to inform critical decision making, anddirect response and recovery efforts.DTRA J9 CB leads DoD S&T to anticipate, defend, and safeguardagainst chemical and biological threats for the warfighter and thenation.These agencies have partnered to meet the evolving needs of thebiosurveillance community and address gaps in technology and datasharing capabilities. High-profile events such as the 2009 H1N1pandemic, the West African Ebola outbreak, and the recent emergenceof Zika virus disease have underscored the need for integration ofdisparate biosurveillance systems to provide a more functionalinfrastructure. This allows analysts and others in the communityto collect, analyze, and share relevant data across organizationssecurely and efficiently. Leveraging existing biosurveillance effortsprovides the federal public health community, and its partners, witha comprehensive interagency platform that enables engagement anddata sharing.MethodsNBIC and DTRA are leveraging existing biosurveillance projectsto share data feeds, work processes, resources, and lessons learned.A multi-stakeholder Agile process was implemented to representthe interests of NBIC, DTRA, and their respective partners. Systemrequirements generated by both agencies were combined to form asingle backlog of prioritized needs. Functional requirements fromNBIC support the development of the prototype by refining systemcapabilities and providing an operational perspective. DTRA’stechnical expertise and research and development (R&D) portfolioensures robust analytic applications are embedded within a secure,scalable system architecture.Integration of analyst validated data from the NBIC Biofeedssystem serves as a gold-standard to improve analytic developmentin machine learning and natural language processing. Additionally,working groups are formed using NBIC and DTRA extendedpartnerships with academia and private industry to expand R&Dpossibilities. These expansions include leveraging existing ontologyefforts for improved system functionality and integrating social mediaalgorithms for improved topic analysis output.ResultsThe combined efforts of these two agencies to develop theBSVE and improve overall biosurveillance processes across thefederal government has enhanced understanding of the needs ofthe community in a variety of mission spaces. To date, co-creation ofproducts, joint analysis, and sharing of data feeds has become a majorpriority for both partners to advance biosurveillance outcomes. Withinthe larger efforts of system development, possible coordination withother agencies such as the Department of Veterans Affairs (VA) andthe US Geological Survey (USGS) could expand reach of the systemto ensure fulfillment of health surveillance requirements as a whole.ConclusionsThe NBIC and DTRA partnership has demonstrated value inimproving biosurveillance capabilities for each agency and theirpartners. BSVE will provide NBIC analysts with a collaborativetool that can leverage use of applications that visualize near real-time global epidemic and outbreak data from a range of unique andtrusted sources. The continued collaboration means ongoing accessto new data streams and analytic processes for all analysts, as wellas advanced machine learning algorithms that increase capabilitiesfor joint analysis, rapid product creation, and continuous interagencycommunication.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Conrad J. Harrison ◽  
Chris J. Sidey-Gibbons

Abstract Background Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software. Methods We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity. Results Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM. Conclusions In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.


2020 ◽  
Vol 14 (2) ◽  
pp. 140-159
Author(s):  
Anthony-Paul Cooper ◽  
Emmanuel Awuni Kolog ◽  
Erkki Sutinen

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


Sign in / Sign up

Export Citation Format

Share Document