Co-Citation Count vs Correlation for Influence Network Visualization

Visualization of author or document influence networks as a two-dimensional image can provide key insights into the direct influence of authors or documents on each other in a document collection. The influence network is constructed based on the minimum spanning tree, in which the nodes are documents and an edge is the most direct influence between two documents. Influence network visualizations have typically relied on co-citation correlation as a measure of document similarity. That is, the similarity between two documents is computed by correlating the sets of citations to each of the two documents. In a different line of research, co-citation count (the number of times two documents are jointly cited) has been applied as a document similarity measure. In this work, we demonstrate the impact of each of these similarity measures on the document influence network. We provide examples, and analyze the significance of the choice of similarity measure. We show that correlation-based visualizations exhibit chaining effects (low average vertex degree), a manifestation of multiple minor variations in document similarities. These minor similarity variations are absent in count-based visualizations. The result is that count-based influence network visualizations are more consistent with the intuitive expectation of authoritative documents being hubs that directly influence large numbers of documents.

Download Full-text

Implementasi Algoritma TF-IDF Pada Pengukuran Kesamaan Dokumen

JuSiTik : Jurnal Sistem dan Teknologi Informasi Komunikasi ◽

10.32524/jusitik.v1i1.218 ◽

2017 ◽

Vol 1 (1) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Sri Andayani ◽

Ady Ryansyah

Keyword(s):

Similarity Measure ◽

Processing Time ◽

Similarity Measures ◽

Cosine Similarity ◽

Vector Representation ◽

Document Collection ◽

Pdf Format ◽

Word Format

Documents similarity measure is a time consuming problem. The large amount of documents and the large number of pages per document are causing the similarity measures to becomes a complicated and hard job to do manually. In this research, a system that can automatically measuring similarity between documents is built by implementing TF-IDF. Measurements are carried by first creating a vector representation of documents being compared. This vector representation containing the weight of each term in the documents. After that, the similarity value are calculated using cosine similarity. The finished system can carry out comparison of documents in pdf or word format. Document comparison can be done using all the chapters in the report, or just a few selected chapters that are considered significant. Based on experiment, it can be concluded that TF-IDF needs at least three documents to be available in the document collection being processes. The test of correlation shows that for document in pdf format, there is a significant correlation between the amount of characters in the document with the processing time.

Download Full-text

Selecting a text similarity measure for a content-based recommender system

The Electronic Library ◽

10.1108/el-08-2018-0165 ◽

2019 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Manjula Wijewickrema ◽

Vivien Petras ◽

Naomal Dias

Keyword(s):

Social Sciences ◽

Similarity Measure ◽

Recommender System ◽

Similarity Measures ◽

Cosine Similarity ◽

Text Similarity ◽

Content Type ◽

Work Related ◽

Cosine Similarity Measures ◽

The Impact

Purpose The purpose of this paper is to develop a journal recommender system, which compares the content similarities between a manuscript and the existing journal articles in two subject corpora (covering the social sciences and medicine). The study examines the appropriateness of three text similarity measures and the impact of numerous aspects of corpus documents on system performance. Design/methodology/approach Implemented three similarity measures one at a time on a journal recommender system with two separate journal corpora. Two distinct samples of test abstracts were classified and evaluated based on the normalized discounted cumulative gain. Findings The BM25 similarity measure outperforms both the cosine and unigram language similarity measures overall. The unigram language measure shows the lowest performance. The performance results are significantly different between each pair of similarity measures, while the BM25 and cosine similarity measures are moderately correlated. The cosine similarity achieves better performance for subjects with higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of corpus journals in the domain of social sciences achieved better performance for cosine similarity and BM25. Originality/value This is the first work related to comparing the suitability of a number of string-based similarity measures with distinct corpora for journal recommender systems.

Download Full-text

News Across Languages - Cross-Lingual Document Similarity and Event Tracking

Journal of Artificial Intelligence Research ◽

10.1613/jair.4780 ◽

2016 ◽

Vol 55 ◽

pp. 283-316 ◽

Cited By ~ 11

Author(s):

Jan Rupnik ◽

Andrej Muhic ◽

Gregor Leban ◽

Primoz Skraba ◽

Blaz Fortuna ◽

...

Keyword(s):

Similarity Measure ◽

Similarity Measures ◽

Training Data ◽

Document Similarity ◽

Significant Events ◽

Extensive Evaluation ◽

Event Tracking ◽

Cross Lingual ◽

Different Sources

In today's world, we follow news which is distributed globally. Significant events are reported by different sources and in different languages. In this work, we address the problem of tracking of events in a large multilingual stream. Within a recently developed system Event Registry we examine two aspects of this problem: how to compare articles in different languages and how to link collections of articles in different languages which refer to the same event. Taking a multilingual stream and clusters of articles from each language, we compare different cross-lingual document similarity measures based on Wikipedia. This allows us to compute the similarity of any two articles regardless of language. Building on previous work, we show there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data. Using this capability, we then propose an approach to link clusters of articles across languages which represent the same event. We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm.

Download Full-text

MATHURA (MBI) - A NOVEL IMPUTATION MEASURE FOR IMPUTATION OF MISSING VALUES IN MEDICAL DATASETS

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666191216123352 ◽

2019 ◽

Vol 13 ◽

Author(s):

B. Mathura Bai ◽

N. Mangathayaru ◽

B. Padmaja Rani ◽

Shadi Aljawarneh

Keyword(s):

Similarity Measure ◽

Medical Records ◽

Missing Values ◽

Similarity Measures ◽

Common Problems ◽

Experiment Analysis

: Missing attribute values in medical datasets are one of the most common problems faced when mining medical datasets. Estimation of missing values is a major challenging task in pre-processing of datasets. Any wrong estimate of missing attribute values can lead to inefficient and improper classification thus resulting in lower classifier accuracies. Similarity measures play a key role during the imputation process. The use of an appropriate and better similarity measure can help to achieve better imputation and improved classification accuracies. This paper proposes a novel imputation measure for finding similarity between missing and non-missing instances in medical datasets. Experiments are carried by applying both the proposed imputation technique and popular benchmark existing imputation techniques. Classification is carried using KNN, J48, SMO and RBFN classifiers. Experiment analysis proved that after imputation of medical records using proposed imputation technique, the resulting classification accuracies reported by the classifiers KNN, J48 and SMO have improved when compared to other existing benchmark imputation techniques.

Download Full-text

Correlation of Altmetric Attention Scores with Citations in Major Pharmacy Journals (Preprint)

10.2196/preprints.13129 ◽

2018 ◽

Author(s):

Dave L Dixon ◽

William L Baker

Keyword(s):

Citation Count ◽

Cross Sectional Study ◽

Scientific Output ◽

Cross Sectional ◽

Promotion And Tenure ◽

Article Type ◽

Key Factor ◽

Number Of Citations ◽

The Impact

BACKGROUND The impact and quality of a faculty members publications is a key factor in promotion and tenure decisions and career advancement. Traditional measures, including citation counts and journal impact factor, have notable limitations. Since 2010, alternative metrics have been proposed as another means of assessing the impact and quality of scholarly work. The Altmetric Attention Score is an objective score frequently used to determine the immediate reach of a published work across the web, including news outlets, blogs, social media, and more. Several studies evaluating the correlation between the Altmetric Attention Score and number of citations have found mixed results and may be discipline-specific. OBJECTIVE To determine the correlation between higher Altmetric Attention Scores and citation count for journal articles published in major pharmacy journals. METHODS This cross-sectional study evaluated articles from major pharmacy journals ranked in the top 10% according to the Altmetric Attention Score. Sources of attention that determined the Altmetric Attention Score were obtained, as well each articles open access status, article type, study design, and topic. Correlation between journal characteristics, including the Altmetric Attention Score and number of citations, was assessed using the Spearman’s correlation test. A Kruskal-Wallis 1-way analysis of variance (ANOVA) was used to compare the Altmetric Attention Scores between journals. RESULTS Six major pharmacy journals were identified. A total of 1,376 articles were published in 2017 and 137 of these represented the top 10% with the highest Altmetric Attention Scores. The median Altmetric Attention Score was 19 (IQR 15-28). Twitter and Mendeley were the most common sources of attention. Over half (56.2%) of the articles were original investigations and 49.8% were either cross-sectional, qualitative, or cohort studies. No significant correlation was found between the Altmetric Attention Score and citation count (rs=0.07, P = 0.485). Mendeley was the only attention source that correlated with the number of citations (rs=0.486, P<0.001). The median Altmetric Attention Score varied widely between each journal (P<0.001). CONCLUSIONS The overall median Altmetric Attention score of 19 suggests articles published in major pharmacy journals are near the top 5% of all scientific output. However, we found no correlation between the Altmetric Attention Score and number of citations for articles published in major pharmacy journals in the year 2017.

Download Full-text

Examining Organizational, Cultural, and Individual-Level Factors Related to Workplace Safety and Health Awareness and Risks: A Systematic Review and Metric Analysis (Preprint)

10.2196/preprints.12356 ◽

2018 ◽

Author(s):

Edmund W. J. Lee ◽

Han Zheng ◽

Htet Htet Aung ◽

Megha Rani Aroor ◽

Chen Li ◽

...

Keyword(s):

Systematic Review ◽

Citation Count ◽

Workplace Safety ◽

Key Factors ◽

Health Awareness ◽

Individual Level ◽

Metric Analysis ◽

Safety And Health ◽

The Impact ◽

The Relationship

BACKGROUND Promoting safety and health awareness and mitigating risks are of paramount importance to companies in high-risk industries. Yet, there are very few studies that have synthesized findings from existing online workplace safety and health literature to identify what are the key factors that are related to (a) safety awareness, (b) safety risks, (c) health awareness, and (d) health risks. OBJECTIVE As one of the first systematic reviews in the area of workplace health and safety, this study aims to identify the factors related to safety and health awareness as well as risks, and systematically map these factors within three levels: organizational, cultural, and individual level. Also, this review aims to assess the impact of these workplace safety and health publications in both academic (e.g., academic databases, Mendeley, and PlumX) and non-academic settings (e.g., social media platform). METHODS The systematic review was conducted in line with procedures recommended by Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA). First, Proquest, ScienceDirect and Scopus were identified as suitable databases for the systematic review. Second, after inputting search queries related to safety and health awareness and risks, the articles were evaluated based on a set of inclusion and exclusion criteria. Third, the factors identified in the included articles were coded systematically. Fourth, the research team assessed the impact of the articles through a combination of traditional and new metric analysis methods: citation count, Altmetric Attention Score, Mendeley readers count, usage count, and capture count. RESULTS Out of a total of 4,831 articles retrieved from the three databases, 51 articles were included in the final sample and were systematically coded. The results revealed six categories of organizational (management commitment, management support, organizational safety communication, safety management systems, physical work environment, and organizational environment), two cultural (interpersonal support and organizational culture), and four individual (perception, motivation, attitude and behavior) level factors that relate to safety and health awareness and risk. In terms of impact, the relationship between citation count and the various metrics measuring academic activity (e.g., Mendeley readers, usage count, and capture count) were mostly significant while the relationship between citation count and Altmetric Attention Score was non-significant. CONCLUSIONS This study provides a macro view of the current state of workplace safety and health research and gives scholars an indication on some of the key factors of safety and health awareness and risks. Researchers should also be cognizant that while their work may receive attention from the scholarly community, it is important to tailor their communication messages for the respective industries they are studying to maximize the receptivity and impact of their findings. CLINICALTRIAL N.A.

Download Full-text

Standing on the shoulders of Giants: a citation analysis of the paediatric congenital heart disease literature

Cardiology in the Young ◽

10.1017/s1047951121001256 ◽

2021 ◽

pp. 1-9

Author(s):

Daniel P. Sew ◽

Nigel E. Drury

Keyword(s):

Citation Analysis ◽

San Francisco ◽

Citation Count ◽

Case Series ◽

Scientific Progress ◽

The United States ◽

Original Research ◽

Prolific Author ◽

Journal Citation Reports ◽

The Impact

Abstract Objective: The citation history of a published article reflects its impact on the literature over time. We conducted a comprehensive bibliometric analysis to identify the most cited papers on CHD in children. Methods: One-hundred and ninety journals listed in Journal Citation Reports were accessed via Web of Science. Publications with 250 or more citations were identified from Science Citation Index Expanded (1900–2020), and those relating to structural CHD in children were reviewed. Articles were ranked by citation count and the 100 most cited were analysed. Results: The number of citations ranged from 2522 to 309 (median 431, IQR 356–518), with 35 published since 2000. All were written in English, most originated from the United States (74%), and were published in cardiovascular journals, with Circulation (28%) the most frequent. There were 86 original research articles, including 50 case series, 14 cohort studies, and 10 clinical trials. The most cited paper was by Hoffman JI and Kaplan S on the incidence of CHD. Thirteen authors had 4 or more publications in the top 100, all of whom had worked in Boston, Philadelphia, San Francisco, or Dallas, and the most prolific author was Newburger JW (9 articles). Conclusions: Citation analysis provides a historical perspective on scientific progress by assessing the impact of individual articles. Our study highlights the dominant position of US-based researchers and journals in this field. Most of the highly cited articles remain case series, with few randomised controlled trials in CHD appearing in recent years.

Download Full-text

A set theory based similarity measure for text clustering and classification

Journal Of Big Data ◽

10.1186/s40537-020-00344-3 ◽

2020 ◽

Vol 7 (1) ◽

Cited By ~ 1

Author(s):

Ali A. Amer ◽

Hassan I. Abdalla

Keyword(s):

Set Theory ◽

Similarity Measure ◽

Similarity Measures ◽

Text Clustering ◽

Plagiarism Detection ◽

K Nearest Neighbor ◽

Single Measure ◽

Highly Effective ◽

Clustering And Classification ◽

Effectiveness And Efficiency

Abstract Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.

Download Full-text

Evaluating the effect of compressing algorithms for trajectory similarity and classification problems

GeoInformatica ◽

10.1007/s10707-021-00434-1 ◽

2021 ◽

Author(s):

Antonios Makris ◽

Camila Leite da Silva ◽

Vania Bogorny ◽

Luis Otavio Alvares ◽

Jose Antonio Macedo ◽

...

Keyword(s):

Trajectory Analysis ◽

Similarity Measures ◽

Classification Problems ◽

Trajectory Data ◽

Compression Algorithms ◽

Time Ratio ◽

Ratio Speed ◽

Trajectory Similarity ◽

Real World Datasets ◽

The Impact

AbstractDuring the last few years the volumes of the data that synthesize trajectories have expanded to unparalleled quantities. This growth is challenging traditional trajectory analysis approaches and solutions are sought in other domains. In this work, we focus on data compression techniques with the intention to minimize the size of trajectory data, while, at the same time, minimizing the impact on the trajectory analysis methods. To this extent, we evaluate five lossy compression algorithms: Douglas-Peucker (DP), Time Ratio (TR), Speed Based (SP), Time Ratio Speed Based (TR_SP) and Speed Based Time Ratio (SP_TR). The comparison is performed using four distinct real world datasets against six different dynamically assigned thresholds. The effectiveness of the compression is evaluated using classification techniques and similarity measures. The results showed that there is a trade-off between the compression rate and the achieved quality. The is no “best algorithm” for every case and the choice of the proper compression algorithm is an application-dependent process.

Download Full-text

A New Approach to Measuring the Similarity of Indoor Semantic Trajectories

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10020090 ◽

2021 ◽

Vol 10 (2) ◽

pp. 90

Author(s):

Jin Zhu ◽

Dayu Cheng ◽

Weiwei Zhang ◽

Ci Song ◽

Jie Chen ◽

...

Keyword(s):

Similarity Measure ◽

Semantic Information ◽

Edit Distance ◽

Similarity Measures ◽

Indoor Positioning ◽

Synthetic Dataset ◽

Shopping Mall ◽

Indoor Space ◽

Trajectory Similarity ◽

Indoor Spaces

People spend more than 80% of their time in indoor spaces, such as shopping malls and office buildings. Indoor trajectories collected by indoor positioning devices, such as WiFi and Bluetooth devices, can reflect human movement behaviors in indoor spaces. Insightful indoor movement patterns can be discovered from indoor trajectories using various clustering methods. These methods are based on a measure that reflects the degree of similarity between indoor trajectories. Researchers have proposed many trajectory similarity measures. However, existing trajectory similarity measures ignore the indoor movement constraints imposed by the indoor space and the characteristics of indoor positioning sensors, which leads to an inaccurate measure of indoor trajectory similarity. Additionally, most of these works focus on the spatial and temporal dimensions of trajectories and pay less attention to indoor semantic information. Integrating indoor semantic information such as the indoor point of interest into the indoor trajectory similarity measurement is beneficial to discovering pedestrians having similar intentions. In this paper, we propose an accurate and reasonable indoor trajectory similarity measure called the indoor semantic trajectory similarity measure (ISTSM), which considers the features of indoor trajectories and indoor semantic information simultaneously. The ISTSM is modified from the edit distance that is a measure of the distance between string sequences. The key component of the ISTSM is an indoor navigation graph that is transformed from an indoor floor plan representing the indoor space for computing accurate indoor walking distances. The indoor walking distances and indoor semantic information are fused into the edit distance seamlessly. The ISTSM is evaluated using a synthetic dataset and real dataset for a shopping mall. The experiment with the synthetic dataset reveals that the ISTSM is more accurate and reasonable than three other popular trajectory similarities, namely the longest common subsequence (LCSS), edit distance on real sequence (EDR), and the multidimensional similarity measure (MSM). The case study of a shopping mall shows that the ISTSM effectively reveals customer movement patterns of indoor customers.

Download Full-text