BACKGROUND
The amount of available textual health data such as scientific and biomedical literature is constantly growing and it becomes more and more challenging for health professionals to properly summarise those data and in consequence to practice evidence-based clinical decision making. Moreover, the exploration of large unstructured health text data is very challenging for non experts due to limited time, resources and skills. Current tools to explore text data lack ease of use, need high computation efforts and have difficulties to incorporate domain knowledge and focus on topics of interest.
OBJECTIVE
We developed a methodology which is able to explore and target topics of interest via an interactive user interface for experts and non-experts. We aim to reach near state of the art performance, while reducing memory consumption, increasing scalability and minimizing user interaction effort to improve the clinical decision making process. The performance is evaluated on diabetes-related abstracts from Pubmed.
METHODS
The methodology consists of four parts: 1) A novel interpretable hierarchical clustering of documents where each node is defined by headwords (describe documents in this node the most); 2) An efficient classification system to target topics; 3) Minimized users interaction effort through active learning; 4) A visual user interface through which a user interacts.
We evaluated our approach on 50,911 diabetes-related abstracts from Pubmed which provide a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against three other strategies: random selection of training instances, uncertainty sampling which chooses instances the model is most uncertain about and an expected gradient length strategy based on convolutional neural networks (CNN).
RESULTS
For the hierarchical clustering performance, we achieved a F1-Score of 0.73 compared to scikit-learn’s of 0.76. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1-Score over all MeSH codes resulted in satisfying 0.62 F1-Score of our approach, compared to 0.61 of the uncertainty strategy, 0.61 the CNN and 0.45 the random strategy.
Moreover, our methodology showed a constant low memory use with increased number of documents but increased execution time.
CONCLUSIONS
We proposed an easy to use tool for experts and non-experts being able to combine domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore our approach is very memory efficient and highly parallelizable making it interesting for large Big Data sets. This approach can be used by health professionals to rapidly get deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.