A dependency-based machine learning approach to the identification of research topics: a case in COVID-19 studies

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Haoran Zhu ◽  
Lei Lei

PurposePrevious research concerning automatic extraction of research topics mostly used rule-based or topic modeling methods, which were challenged due to the limited rules, the interpretability issue and the heavy dependence on human judgment. This study aims to address these issues with the proposal of a new method that integrates machine learning models with linguistic features for the identification of research topics.Design/methodology/approachFirst, dependency relations were used to extract noun phrases from research article texts. Second, the extracted noun phrases were classified into topics and non-topics via machine learning models and linguistic and bibliometric features. Lastly, a trend analysis was performed to identify hot research topics, i.e. topics with increasing popularity.FindingsThe new method was experimented on a large dataset of COVID-19 research articles and achieved satisfactory results in terms of f-measures, accuracy and AUC values. Hot topics of COVID-19 research were also detected based on the classification results.Originality/valueThis study demonstrates that information retrieval methods can help researchers gain a better understanding of the latest trends in both COVID-19 and other research areas. The findings are significant to both researchers and policymakers.

2019 ◽  
Vol 14 (2) ◽  
pp. 97-106
Author(s):  
Ning Yan ◽  
Oliver Tat-Sheung Au

Purpose The purpose of this paper is to make a correlation analysis between students’ online learning behavior features and course grade, and to attempt to build some effective prediction model based on limited data. Design/methodology/approach The prediction label in this paper is the course grade of students, and the eigenvalues available are student age, student gender, connection time, hits count and days of access. The machine learning model used in this paper is the classical three-layer feedforward neural networks, and the scaled conjugate gradient algorithm is adopted. Pearson correlation analysis method is used to find the relationships between course grade and the student eigenvalues. Findings Days of access has the highest correlation with course grade, followed by hits count, and connection time is less relevant to students’ course grade. Student age and gender have the lowest correlation with course grade. Binary classification models have much higher prediction accuracy than multi-class classification models. Data normalization and data discretization can effectively improve the prediction accuracy of machine learning models, such as ANN model in this paper. Originality/value This paper may help teachers to find some clue to identify students with learning difficulties in advance and give timely help through the online learning behavior data. It shows that acceptable prediction models based on machine learning can be built using a small and limited data set. However, introducing external data into machine learning models to improve its prediction accuracy is still a valuable and hard issue.


mSystems ◽  
2020 ◽  
Vol 5 (1) ◽  
Author(s):  
D. Aytan-Aktug ◽  
P. T. L. C. Clausen ◽  
V. Bortolaia ◽  
F. M. Aarestrup ◽  
O. Lund

ABSTRACT Machine learning has proven to be a powerful method to predict antimicrobial resistance (AMR) without using prior knowledge for selected bacterial species-antimicrobial combinations. To date, only species-specific machine learning models have been developed, and to the best of our knowledge, the inclusion of information from multiple species has not been attempted. The aim of this study was to determine the feasibility of including information from multiple bacterial species to predict AMR for an individual species, since this may make it easier to train and update resistance predictions for multiple species and may lead to improved predictions. Whole-genome sequence data and susceptibility profiles from 3,528 Mycobacterium tuberculosis, 1,694 Escherichia coli, 658 Salmonella enterica, and 1,236 Staphylococcus aureus isolates were included. We developed machine learning models trained by the features of the PointFinder and ResFinder programs detected to predict binary (susceptible/resistant) AMR profiles. We tested four feature representation methods to determine the most efficient way for introducing features into the models. When training the model only on the Mycobacterium tuberculosis isolates, high prediction performances were obtained for the six AMR profiles included. By adding information on ciprofloxacin from the additional 3,588 isolates, there was no reduction in performance for the other antimicrobials but an increased performance for ciprofloxacin AMR profile prediction for Mycobacterium tuberculosis and Escherichia coli. In conclusion, the species-independent models can predict multi-AMR profiles for multiple species without losing any robustness. IMPORTANCE Machine learning is a proven method to predict AMR; however, the performance of any machine learning model depends on the quality of the input data. Therefore, we evaluated different methods of representing information about mutations as well as mobilizable genes, so that the information can serve as input for a robust model. We combined data from multiple bacterial species in order to develop species-independent machine learning models that can predict resistance profiles for multiple antimicrobials and species with high performance.


Significance It required arguably the single largest computational effort for a machine learning model to date, and is it capable of producing text at times indistinguishable from the work of a human author. This has generated considerable excitement about potentially transformative business applications -- and concerns about the system's weaknesses and possible misuse. Impacts Stereotypes and biases in machine learning models will become increasingly problematic as they are adopted by businesses and governments. The use of flawed AI tools that result in embarrassing failures risk cuts to public funding for AI research. Academia and industry face pressure to advance research into explainable AI, but progress is slow.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Amirhessam Tahmassebi ◽  
Mehrtash Motamedi ◽  
Amir H. Alavi ◽  
Amir H. Gandomi

PurposeEngineering design and operational decisions depend largely on deep understanding of applications that requires assumptions for simplification of the problems in order to find proper solutions. Cutting-edge machine learning algorithms can be used as one of the emerging tools to simplify this process. In this paper, we propose a novel scalable and interpretable machine learning framework to automate this process and fill the current gap.Design/methodology/approachThe essential principles of the proposed pipeline are mainly (1) scalability, (2) interpretibility and (3) robust probabilistic performance across engineering problems. The lack of interpretibility of complex machine learning models prevents their use in various problems including engineering computation assessments. Many consumers of machine learning models would not trust the results if they cannot understand the method. Thus, the SHapley Additive exPlanations (SHAP) approach is employed to interpret the developed machine learning models.FindingsThe proposed framework can be applied to a variety of engineering problems including seismic damage assessment of structures. The performance of the proposed framework is investigated using two case studies of failure identification in reinforcement concrete (RC) columns and shear walls. In addition, the reproducibility, reliability and generalizability of the results were validated and the results of the framework were compared to the benchmark studies. The results of the proposed framework outperformed the benchmark results with high statistical significance.Originality/valueAlthough, the current study reveals that the geometric input features and reinforcement indices are the most important variables in failure modes detection, better model can be achieved with employing more robust strategies to establish proper database to decrease the errors in some of the failure modes identification.


mBio ◽  
2020 ◽  
Vol 11 (4) ◽  
Author(s):  
Nathan B. Pincus ◽  
Egon A. Ozer ◽  
Jonathan P. Allen ◽  
Marcus Nguyen ◽  
James J. Davis ◽  
...  

ABSTRACT Variation in the genome of Pseudomonas aeruginosa, an important pathogen, can have dramatic impacts on the bacterium’s ability to cause disease. We therefore asked whether it was possible to predict the virulence of P. aeruginosa isolates based on their genomic content. We applied a machine learning approach to a genetically and phenotypically diverse collection of 115 clinical P. aeruginosa isolates using genomic information and corresponding virulence phenotypes in a mouse model of bacteremia. We defined the accessory genome of these isolates through the presence or absence of accessory genomic elements (AGEs), sequences present in some strains but not others. Machine learning models trained using AGEs were predictive of virulence, with a mean nested cross-validation accuracy of 75% using the random forest algorithm. However, individual AGEs did not have a large influence on the algorithm’s performance, suggesting instead that virulence predictions are derived from a diffuse genomic signature. These results were validated with an independent test set of 25 P. aeruginosa isolates whose virulence was predicted with 72% accuracy. Machine learning models trained using core genome single-nucleotide variants and whole-genome k-mers also predicted virulence. Our findings are a proof of concept for the use of bacterial genomes to predict pathogenicity in P. aeruginosa and highlight the potential of this approach for predicting patient outcomes. IMPORTANCE Pseudomonas aeruginosa is a clinically important Gram-negative opportunistic pathogen. P. aeruginosa shows a large degree of genomic heterogeneity both through variation in sequences found throughout the species (core genome) and through the presence or absence of sequences in different isolates (accessory genome). P. aeruginosa isolates also differ markedly in their ability to cause disease. In this study, we used machine learning to predict the virulence level of P. aeruginosa isolates in a mouse bacteremia model based on genomic content. We show that both the accessory and core genomes are predictive of virulence. This study provides a machine learning framework to investigate relationships between bacterial genomes and complex phenotypes such as virulence.


2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Dinda Thalia Andariesta ◽  
Meditya Wasesa

PurposeThis research presents machine learning models for predicting international tourist arrivals in Indonesia during the COVID-19 pandemic using multisource Internet data.Design/methodology/approachTo develop the prediction models, this research utilizes multisource Internet data from TripAdvisor travel forum and Google Trends. Temporal factors, posts and comments, search queries index and previous tourist arrivals records are set as predictors. Four sets of predictors and three distinct data compositions were utilized for training the machine learning models, namely artificial neural networks (ANNs), support vector regression (SVR) and random forest (RF). To evaluate the models, this research uses three accuracy metrics, namely root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE).FindingsPrediction models trained using multisource Internet data predictors have better accuracy than those trained using single-source Internet data or other predictors. In addition, using more training sets that cover the phenomenon of interest, such as COVID-19, will enhance the prediction model's learning process and accuracy. The experiments show that the RF models have better prediction accuracy than the ANN and SVR models.Originality/valueFirst, this study pioneers the practice of a multisource Internet data approach in predicting tourist arrivals amid the unprecedented COVID-19 pandemic. Second, the use of multisource Internet data to improve prediction performance is validated with real empirical data. Finally, this is one of the few papers to provide perspectives on the current dynamics of Indonesia's tourism demand.


2021 ◽  
Author(s):  
Tomasz Konopka ◽  
Damian Smedley

AbstractBiomedical ontologies are established tools that organize knowledge in specialized research areas. They can also be used to train machine-learning models. However, it is unclear to what extent representations of ontology concepts learned by machine-learning models capture the relationships intended by ontology curators. It is also unclear whether the representations can provide insights to improve the curation process. Here, we investigate ontologies from across the spectrum of biological research and assess the concordance of formal ontology hierarchies with representations based on plain-text definitions. By comparing the internal properties of each ontology, we describe general patterns across the pan-ontology landscape and pinpoint areas with discrepancies in individual domains. We suggest specific mechanisms through which machine-learning approaches can lead to clarifications of ontology definitions. Synchronizing patterns in machine-derived representations with those intended by the ontology curators will likely streamline the use of ontologies in downstream applications.


2019 ◽  
Vol 13 (4) ◽  
pp. 100983 ◽  
Author(s):  
Shuo Xu ◽  
Liyuan Hao ◽  
Xin An ◽  
Guancan Yang ◽  
Feifei Wang

2021 ◽  
Author(s):  
Najlaa Maaroof ◽  
Antonio Moreno ◽  
Mohammed Jabreel ◽  
Aida Valls

Despite the broad adoption of Machine Learning models in many domains, they remain mostly black boxes. There is a pressing need to ensure Machine Learning models that are interpretable, so that designers and users can understand the reasons behind their predictions. In this work, we propose a new method called C-LORE-F to explain the decisions of fuzzy-based black box models. This new method uses some contextual information about the attributes as well as the knowledge of the fuzzy sets associated to the linguistic labels of the fuzzy attributes to provide actionable explanations. The experimental results on three datasets reveal the effectiveness of C-LORE-F when compared with the most relevant related works.


Author(s):  
Ji In Choi ◽  
Madeleine Georges ◽  
Jung Ah Shin ◽  
Olivia Wang ◽  
Tiffany Zhu ◽  
...  

With advances in edge applications in industry and healthcare, machine learning models are increasingly trained on the edge. However, storage and memory infrastructure at the edge are often primitive, due to cost and real-estate constraints.A simple, effective method is to learn machine learning models from quantized data stored with low arithmetic precision (1-8 bits).In this work, we introduce two stochastic quantization methods, dithering and stochastic rounding. In dithering, additive noise from a uniform distribution is added to the sample before quantization. In stochastic rounding, each sample is quantized to the upper level with probability p and to a lower level with probability 1-p.The key contributions of the paper are as follows: For 3 standard machine learning models, Support Vector Machines, Decision Trees and Linear (Logistic) Regression, we compare the performance loss for a standard static quantization and stochastic quantization for 55 classification and 30 regression datasets with 1-8 bits quantization. We showcase that for 4- and 8-bit quantization over regression datasets, stochastic quantization demonstrates statistically significant improvement.  We investigate the performance loss as a function of dataset attributes viz. number of features, standard deviation, skewness. This helps create a transfer function which will recommend the best quantizer for a given dataset. We propose 2 future research areas, dynamic quantizer update where the model is trained using streaming data and the quantizer is updated after each batch and precision re-allocation under budget constraints where different precision is used for different features.


Sign in / Sign up

Export Citation Format

Share Document