Frequent Itemsets as Meaningful Events in Graphs for Summarizing Biomedical Texts

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k -Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.

Download Full-text

A Mining Frequent Itemsets Algorithm in Stream Data Based on Sliding Time Decay Window

Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition ◽

10.1145/3430199.3430226 ◽

2020 ◽

Author(s):

Xin Lu ◽

Shaonan Jin ◽

Xun Wang ◽

Jiao Yuan ◽

Kun Fu ◽

...

Keyword(s):

Frequent Itemsets ◽

Time Decay ◽

Stream Data ◽

Mining Frequent Itemsets

Download Full-text

BioVerbNet: a large semantic-syntactic classification of verbs in biomedicine

Journal of Biomedical Semantics ◽

10.1186/s13326-021-00247-z ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Olga Majewska ◽

Charlotte Collins ◽

Simon Baker ◽

Jari Björne ◽

Susan Windisch Brown ◽

...

Keyword(s):

Natural Language ◽

Language Processing ◽

Model Performance ◽

Representation Learning ◽

Verb Classes ◽

Expert Annotation ◽

Biomedical Texts ◽

Time Required ◽

Verb Meaning

Abstract Background Recent advances in representation learning have enabled large strides in natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge have been shown to boost model performance in different Natural Language Processing (NLP) tasks where accurate handling of verb meaning and behaviour is critical. The costliness and time required for manual lexicon construction has been a major obstacle to porting the benefits of such resources to NLP in specialised domains, such as biomedicine. To address this issue, we combine a neural classification method with expert annotation to create BioVerbNet. This new resource comprises 693 verbs assigned to 22 top-level and 117 fine-grained semantic-syntactic verb classes. We make this resource available complete with semantic roles and VerbNet-style syntactic frames. Results We demonstrate the utility of the new resource in boosting model performance in document- and sentence-level classification in biomedicine. We apply an established retrofitting method to harness the verb class membership knowledge from BioVerbNet and transform a pretrained word embedding space by pulling together verbs belonging to the same semantic-syntactic class. The BioVerbNet knowledge-aware embeddings surpass the non-specialised baseline by a significant margin on both tasks. Conclusion This work introduces the first large, annotated semantic-syntactic classification of biomedical verbs, providing a detailed account of the annotation process, the key differences in verb behaviour between the general and biomedical domain, and the design choices made to accurately capture the meaning and properties of verbs used in biomedical texts. The demonstrated benefits of leveraging BioVerbNet in text classification suggest the resource could help systems better tackle challenging NLP tasks in biomedicine.

Download Full-text

Knowledge Enhanced LSTM for Coreference Resolution on Biomedical Texts

Bioinformatics ◽

10.1093/bioinformatics/btab153 ◽

2021 ◽

Author(s):

Yufei Li ◽

Xiaoyong Ma ◽

Xiangyu Zhou ◽

Pengzhen Cheng ◽

Kai He ◽

...

Keyword(s):

Information Integration ◽

Short Term Memory ◽

Superior Performance ◽

Supplementary Information ◽

Specific Information ◽

Coreference Resolution ◽

Fine Grained ◽

Domain Specific ◽

Memory Network ◽

Biomedical Texts

Abstract Motivation Bio-entity Coreference Resolution focuses on identifying the coreferential links in biomedical texts, which is crucial to complete bio-events’ attributes and interconnect events into bio-networks. Previously, as one of the most powerful tools, deep neural network-based general domain systems are applied to the biomedical domain with domain-specific information integration. However, such methods may raise much noise due to its insufficiency of combining context and complex domain-specific information. Results In this paper, we explore how to leverage the external knowledge base in a fine-grained way to better resolve coreference by introducing a knowledge-enhanced Long Short Term Memory network (LSTM), which is more flexible to encode the knowledge information inside the LSTM. Moreover, we further propose a knowledge attention module to extract informative knowledge effectively based on contexts. The experimental results on the BioNLP and CRAFT datasets achieve state-of-the-art performance, with a gain of 7.5 F1 on BioNLP and 10.6 F1 on CRAFT. Additional experiments also demonstrate superior performance on the cross-sentence coreferences. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text