Multi-label classification of research articles using Word2Vec and identification of similarity threshold

AbstractEvery year, around 28,100 journals publish 2.5 million research publications. Search engines, digital libraries, and citation indexes are used extensively to search these publications. When a user submits a query, it generates a large number of documents among which just a few are relevant. Due to inadequate indexing, the resultant documents are largely unstructured. Publicly known systems mostly index the research papers using keywords rather than using subject hierarchy. Numerous methods reported for performing single-label classification (SLC) or multi-label classification (MLC) are based on content and metadata features. Content-based techniques offer higher outcomes due to the extreme richness of features. But the drawback of content-based techniques is the unavailability of full text in most cases. The use of metadata-based parameters, such as title, keywords, and general terms, acts as an alternative to content. However, existing metadata-based techniques indicate low accuracy due to the use of traditional statistical measures to express textual properties in quantitative form, such as BOW, TF, and TFIDF. These measures may not establish the semantic context of the words. The existing MLC techniques require a specified threshold value to map articles into predetermined categories for which domain knowledge is necessary. The objective of this paper is to get over the limitations of SLC and MLC techniques. To capture the semantic and contextual information of words, the suggested approach leverages the Word2Vec paradigm for textual representation. The suggested model determines threshold values using rigorous data analysis, obviating the necessity for domain expertise. Experimentation is carried out on two datasets from the field of computer science (JUCS and ACM). In comparison to current state-of-the-art methodologies, the proposed model performed well. Experiments yielded average accuracy of 0.86 and 0.84 for JUCS and ACM for SLC, and 0.81 and 0.80 for JUCS and ACM for MLC. On both datasets, the proposed SLC model improved the accuracy up to 4%, while the proposed MLC model increased the accuracy up to 3%.

Download Full-text

Contextualized Filtering for Shared Cyber Threat Information

Sensors ◽

10.3390/s21144890 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4890

Author(s):

Athanasios Dimitriadis ◽

Christos Prassas ◽

Jose Luis Flores ◽

Boonserm Kulvatunyou ◽

Nenad Ivezic ◽

...

Keyword(s):

Information Sharing ◽

Business Processes ◽

State Of The Art ◽

Contextual Information ◽

Coarse Grained ◽

Business Information ◽

Cyber Threat ◽

Domain Expertise ◽

Multi Level ◽

Filtering Approach

Cyber threat information sharing is an imperative process towards achieving collaborative security, but it poses several challenges. One crucial challenge is the plethora of shared threat information. Therefore, there is a need to advance filtering of such information. While the state-of-the-art in filtering relies primarily on keyword- and domain-based searching, these approaches require sizable human involvement and rarely available domain expertise. Recent research revealed the need for harvesting of business information to fill the gap in filtering, albeit it resulted in providing coarse-grained filtering based on the utilization of such information. This paper presents a novel contextualized filtering approach that exploits standardized and multi-level contextual information of business processes. The contextual information describes the conditions under which a given threat information is actionable from an organization perspective. Therefore, it can automate filtering by measuring the equivalence between the context of the shared threat information and the context of the consuming organization. The paper directly contributes to filtering challenge and indirectly to automated customized threat information sharing. Moreover, the paper proposes the architecture of a cyber threat information sharing ecosystem that operates according to the proposed filtering approach and defines the characteristics that are advantageous to filtering approaches. Implementation of the proposed approach can support compliance with the Special Publication 800-150 of the National Institute of Standards and Technology.

Download Full-text

Aspect Based Sentiment Analysis of Unlabeled Reviews using Linguistic Rule Based LDA

Journal of Cases on Information Technology ◽

10.4018/jcit.20220801oa05 ◽

2022 ◽

Vol 24 (3) ◽

pp. 0-0

Keyword(s):

Social Networks ◽

Sentiment Analysis ◽

Domain Knowledge ◽

Latent Dirichlet Allocation ◽

Rule Based ◽

Aspect Extraction ◽

Linguistic Rule ◽

Digital Era ◽

Average Accuracy ◽

Linguistic Rules

In this digital era, people are very keen to share their feedback about any product, services, or current issues on social networks and other platforms. A fine analysis of these feedbacks can give a clear picture of what people think about a particular topic. This work proposed an almost unsupervised Aspect Based Sentiment Analysis approach for textual reviews. Latent Dirichlet Allocation, along with linguistic rules, is used for aspect extraction. Aspects are ranked based on their probability distribution values and then clustered into predefined categories using frequent terms with domain knowledge. SentiWordNet lexicon uses for sentiment scoring and classification. The experiment with two popular datasets shows the superiority of our strategy as compared to existing methods. It shows the 85% average accuracy when tested on manually labeled data.

Download Full-text

Context-based Word Acquisition for Situated Dialogue in a Virtual World

Journal of Artificial Intelligence Research ◽

10.1613/jair.2912 ◽

2010 ◽

Vol 37 ◽

pp. 247-277 ◽

Cited By ~ 4

Author(s):

S. Qu ◽

J. Y. Chai

Keyword(s):

Virtual World ◽

Language Production ◽

Domain Knowledge ◽

Contextual Information ◽

Eye Gaze ◽

Learning Approaches ◽

New Words ◽

Word Acquisition ◽

Language Behavior ◽

Situated Dialogue

To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been addressed. First, psycholinguistic studies have shown certain temporal regularities between human eye movement and language production. While these regularities can potentially guide the acquisition process, they have not been incorporated in the previous unsupervised approaches. Second, conversational systems generally have an existing knowledge base about the domain and vocabulary. While the existing knowledge can potentially help bootstrap and constrain the acquired new words, it has not been incorporated in the previous models. Third, eye gaze could serve different functions in human-machine conversation. Some gaze streams may not be closely coupled with speech stream, and thus are potentially detrimental to word acquisition. Automated recognition of closely-coupled speech-gaze streams based on conversation context is important. To address these issues, we developed new approaches that incorporate user language behavior, domain knowledge, and conversation context in word acquisition. We evaluated these approaches in the context of situated dialogue in a virtual world. Our experimental results have shown that incorporating the above three types of contextual information significantly improves word acquisition performance.

Download Full-text

AUTOMATIC ONTOLOGY CONSTRUCTION IN FICTION-BASED DOMAIN

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194011005621 ◽

2011 ◽

Vol 21 (08) ◽

pp. 1147-1167 ◽

Cited By ~ 2

Author(s):

HUI-NGO GOH ◽

CHING-CHIEH KIU ◽

LAY-KI SOON ◽

BALI RANAIVO-MALANÇON

Keyword(s):

Domain Knowledge ◽

Threshold Value ◽

New Techniques ◽

Ontology Construction ◽

Non Fiction ◽

Algorithmic Framework

The field of ontology has received attention lately due to the increasing needs in conceptualizing the domain knowledge for resolving various jobs' demand. Numerous new techniques, tools and applications have then been developed for their suitability in managing knowledge. However, most works carried out focused on non-fiction domain and categorizing the concepts into component or cluster. Hence, the originality of the content flow is not preserved. This paper presents an automated ontology construction in fiction domain. The significance of the study lies in (1) designing a simple and easy algorithmic framework for automated ontology construction while preserving the originality of the content flow in an ontology, (2) identification of suitable threshold value in extracting true terms, and (3) process an unstructured fiction-based domain text into meaningful structure automatically.

Download Full-text

Skin Lesion Segmentation Method for Dermoscopy Images Using Artificial Bee Colony Algorithm

Symmetry ◽

10.3390/sym10080347 ◽

2018 ◽

Vol 10 (8) ◽

pp. 347 ◽

Cited By ~ 14

Author(s):

Mohanad Aljanabi ◽

Yasa Özok ◽

Javad Rahebi ◽

Ahmad Abdullah

Keyword(s):

Artificial Bee Colony ◽

Survival Rates ◽

High Specificity ◽

Ground Truth ◽

Threshold Value ◽

Abc Algorithm ◽

Melanoma Detection ◽

Bee Colony ◽

Average Accuracy ◽

Different Sources

The occurrence rates of melanoma are rising rapidly, which are resulting in higher death rates. However, if the melanoma is diagnosed in Phase I, the survival rates increase. The segmentation of the melanoma is one of the largest tasks to undertake and achieve when considering both beneath and over the segmentation. In this work, a new approach based on the artificial bee colony (ABC) algorithm is proposed for the detection of melanoma from digital images. This method is simple, fast, flexible, and requires fewer parameters compared with other algorithms. The proposed approach is applied on the PH2, ISBI 2016 challenge, the ISBI 2017 challenge, and Dermis datasets. These bases contained images are affected by different abnormalities. The formation of the databases consists of images collected from different sources; they are bases with different types of resolution, lighting, etc., so in the first step, the noise was removed from the images by using morphological filtering. In the next step, the ABC algorithm is used to find the optimum threshold value for the melanoma detection. The proposed approach achieved good results in the conditions of high specificity. The experimental results suggest that the proposed method accomplished higher performance compared to the ground truth images supported by a Dermatologist. For the melanoma detection, the method achieved an average accuracy and Jaccard’s coefficient in the range of 95.24–97.61%, and 83.56–85.25% in these four databases. To show the robustness of this work, the results were compared to existing methods in the literature for melanoma detection. High values for estimation performance confirmed that the proposed melanoma detection is better than other algorithms, which demonstrates the highly differential power of the newly introduced features.

Download Full-text

Consortium of online digital libraries: the experience of 280 universities in Russia and the CIS countries

Infolib ◽

10.47267/2181-8207/2021/2-056 ◽

2021 ◽

Vol 26 (2) ◽

pp. 30-33

Author(s):

Anastasia Privalova ◽

◽

Keyword(s):

Higher Education ◽

Digital Libraries ◽

Educational Organizations ◽

Electronic Library ◽

Non Profit ◽

Current State ◽

Scientific Organizations ◽

Russian Higher Education ◽

Cis Countries ◽

Russian Universities

The activity of the modern university library is connected, among other things, with the creation of strong links between different educational and scientific organizations. The goal of such cooperation is to improve the quality and accessibility of Russian higher education, including its digital forms. These problems are solved by a non-profit project of the electronic library system Lan, called Consortium of Network Electronic Libraries (Consortium NEL). It helps universities to optimize costs, increase the number of books and manuals for free. The project already involves 284 universities from Russia and the CIS countries (Kazakhstan, Belarus). Their publications form a book collection of 37 000 textbooks, manuals, workshops and lecture courses on the Lan’s platform. The article describes how the project was created, what is its current state and what are its opportunities. The experience of 284 Russian universities and universities of the CIS countries can be useful for libraries of all educational organizations.

Download Full-text

Concept Mapping to Design, Organize, and Explore Digital Learning Objects

E-Learning for Geographers ◽

10.4018/978-1-59904-980-9.ch010 ◽

2010 ◽

pp. 170-184

Author(s):

David DiBiase ◽

Mark Gahegan

Keyword(s):

Learning Community ◽

Concept Mapping ◽

Digital Libraries ◽

Domain Knowledge ◽

Learning Objects ◽

Learning Material ◽

Digital Learning ◽

Learning Materials ◽

Mapping Tool ◽

Descriptive Knowledge

This chapter investigates the problem of connecting advanced domain knowledge (from geography educators in this instance) with the strong pedagogic descriptions provided by colleagues from the University of Southampton, as described in Chapter IX, and then adding to this the learning materials that together comprise a learning object. Specifically, the chapter describes our efforts to enhance our open-source concept mapping tool (ConceptVista) with a variety of tools and methods that support the visualization, integration, packaging, and publishing of learning objects. We give examples of learning objects created from existing course materials, but enhanced with formal descriptions of both domain content and pedagogy. We then show how such descriptions can offer significant advantages in terms of making domain and pedagogic knowledge explicit, browsing such knowledge to better communicate educational aims and processes, tracking the development of ideas amongst the learning community, providing richer indices into learning material, and packaging these learning materials together with their descriptive knowledge. We explain how the resulting learning objects might be deployed within next-generation digital libraries that provide rich search languages to help educators locate useful learning objects from vast collections of learning materials.

Download Full-text

Literature mining for context-specific molecular relations using multimodal representations (COMMODAR)

BMC Bioinformatics ◽

10.1186/s12859-020-3396-y ◽

2020 ◽

Vol 21 (S5) ◽

Author(s):

Jaehyun Lee ◽

Doheon Lee ◽

Kwang Hyung Lee

Keyword(s):

Biological Networks ◽

Domain Knowledge ◽

Contextual Information ◽

Main Idea ◽

Relation Extraction ◽

Literature Mining ◽

Computing Methodologies ◽

Multimodal Representations ◽

Relational Resources ◽

Context Specific

Abstract Biological contextual information helps understand various phenomena occurring in the biological systems consisting of complex molecular relations. The construction of context-specific relational resources vastly relies on laborious manual extraction from unstructured literature. In this paper, we propose COMMODAR, a machine learning-based literature mining framework for context-specific molecular relations using multimodal representations. The main idea of COMMODAR is the feature augmentation by the cooperation of multimodal representations for relation extraction. We leveraged biomedical domain knowledge as well as canonical linguistic information for more comprehensive representations of textual sources. The models based on multiple modalities outperformed those solely based on the linguistic modality. We applied COMMODAR to the 14 million PubMed abstracts and extracted 9214 context-specific molecular relations. All corpora, extracted data, evaluation results, and the implementation code are downloadable at https://github.com/jae-hyun-lee/commodar. Ccs concepts • Computing methodologies~Information extraction • Computing methodologies~Neural networks • Applied computing~Biological networks.

Download Full-text

Combined Self-Attention Mechanism for Chinese Named Entity Recognition in Military

Future Internet ◽

10.3390/fi11080180 ◽

2019 ◽

Vol 11 (8) ◽

pp. 180

Author(s):

Fei Liao ◽

Liangli Ma ◽

Jingjing Pei ◽

Linshan Tan

Keyword(s):

Domain Knowledge ◽

Short Term Memory ◽

Contextual Information ◽

Named Entity Recognition ◽

Attention Mechanism ◽

The Self ◽

Entity Recognition ◽

Named Entity ◽

Vector Representations ◽

The Military

Military named entity recognition (MNER) is one of the key technologies in military information extraction. Traditional methods for the MNER task rely on cumbersome feature engineering and specialized domain knowledge. In order to solve this problem, we propose a method employing a bidirectional long short-term memory (BiLSTM) neural network with a self-attention mechanism to identify the military entities automatically. We obtain distributed vector representations of the military corpus by unsupervised learning and the BiLSTM model combined with the self-attention mechanism is adopted to capture contextual information fully carried by the character vector sequence. The experimental results show that the self-attention mechanism can improve effectively the performance of MNER task. The F-score of the military documents and network military texts identification was 90.15% and 89.34%, respectively, which was better than other models.

Download Full-text

DOCUMENT IMAGE BINARISATION USING A SUPERVISED NEURAL NETWORK

International Journal of Neural Systems ◽

10.1142/s0129065708001671 ◽

2008 ◽

Vol 18 (05) ◽

pp. 405-418 ◽

Cited By ~ 21

Author(s):

ADNAN KHASHMAN ◽

BORAN SEKEROGLU

Keyword(s):

Neural Network ◽

Digital Libraries ◽

Cost Effective ◽

Threshold Value ◽

Document Image ◽

Local Threshold ◽

Global Threshold ◽

Local Thresholding ◽

Global Thresholding ◽

Degraded Documents

Advances in digital technologies have allowed us to generate more images than ever. Images of scanned documents are examples of these images that form a vital part in digital libraries and archives. Scanned degraded documents contain background noise and varying contrast and illumination, therefore, document image binarisation must be performed in order to separate foreground from background layers. Image binarisation is performed using either local adaptive thresholding or global thresholding; with local thresholding being generally considered as more successful. This paper presents a novel method to global thresholding, where a neural network is trained using local threshold values of an image in order to determine an optimum global threshold value which is used to binarise the whole image. The proposed method is compared with five local thresholding methods, and the experimental results indicate that our method is computationally cost-effective and capable of binarising scanned degraded documents with superior results.

Download Full-text