DISC: Disambiguating homonyms using graph structural clustering

2018 ◽  
Vol 44 (6) ◽  
pp. 830-847 ◽  
Author(s):  
Ijaz Hussain ◽  
Sohail Asghar

Author name ambiguity degrades information retrieval, database integration, search results and, more importantly, correct attributions in bibliographic databases. Some unresolved issues include how to ascertain the actual number of authors, how to improve the performance and how to make the method more effective in terms of representative clustering metrics (average cluster purity, average author purity, K-metric, pairwise precision, pairwise recall, pairwise-F1, cluster precision, cluster recall and cluster-F1). It is a non-trivial task to disambiguate authors using only the implicit bibliographic information. An effective method ‘DISC’ is proposed that uses graph community detection algorithm, feature vectors and graph operations to disambiguate homonyms. The citation data set is pre-processed and ambiguous author blocks are formed. A co-authors graph is constructed using authors and their co-author’s relationships. A graph structural clustering ‘gSkeletonClu’ is applied to identify hubs, outliers and clusters of nodes in a co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub if their feature vector similarity is less than a predefined threshold. DISC utilises only co-authors and titles that are available in almost all bibliographic databases. With little modifications, DISC can also be used for entity disambiguation. To validate the DISC performance, experiments are performed on two Arnetminer data sets and compared with five previous unsupervised methods. Despite using limited bibliographic metadata, DISC achieves on average K-metric, pairwise-F1, and cluster-F1 of 92%, 84% and 74%, respectively, using Arnetminer-S and 86%, 80% and 57%, respectively, using Arnetminer-L. About 77.5% and 73.2% clusters are within the range (ground truth clusters ± 3) in Arnetminer-S and Arnetminer-L, respectively.

2021 ◽  
pp. 2142002
Author(s):  
Giuseppe Agapito ◽  
Marianna Milano ◽  
Mario Cannataro

A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 879 ◽  
Author(s):  
Uwe Köckemann ◽  
Marjan Alirezaie ◽  
Jennifer Renoux ◽  
Nicolas Tsiftes ◽  
Mobyen Uddin Ahmed ◽  
...  

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.


Separations ◽  
2018 ◽  
Vol 5 (3) ◽  
pp. 44 ◽  
Author(s):  
Alyssa Allen ◽  
Mary Williams ◽  
Nicholas Thurn ◽  
Michael Sigman

Computational models for determining the strength of fire debris evidence based on likelihood ratios (LR) were developed and validated against data sets derived from different distributions of ASTM E1618-14 designated ignitable liquid class and substrate pyrolysis contributions using in-silico generated data. The models all perform well in cross validation against the distributions used to generate the model. However, a model generated based on data that does not contain representatives from all of the ASTM E1618-14 classes does not perform well in validation with data sets that contain representatives from the missing classes. A quadratic discriminant model based on a balanced data set (ignitable liquid versus substrate pyrolysis), with a uniform distribution of the ASTM E1618-14 classes, performed well (receiver operating characteristic area under the curve of 0.836) when tested against laboratory-developed casework-relevant samples of known ground truth.


Geophysics ◽  
2020 ◽  
Vol 85 (5) ◽  
pp. KS149-KS160 ◽  
Author(s):  
Anna L. Stork ◽  
Alan F. Baird ◽  
Steve A. Horne ◽  
Garth Naldrett ◽  
Sacha Lapins ◽  
...  

This study presents the first demonstration of the transferability of a convolutional neural network (CNN) trained to detect microseismic events in one fiber-optic distributed acoustic sensing (DAS) data set to other data sets. DAS increasingly is being used for microseismic monitoring in industrial settings, and the dense spatial and temporal sampling provided by these systems produces large data volumes (approximately 650 GB/day for a 2 km long cable sampling at 2000 Hz with a spatial sampling of 1 m), requiring new processing techniques for near-real-time microseismic analysis. We have trained the CNN known as YOLOv3, an object detection algorithm, to detect microseismic events using synthetically generated waveforms with real noise superimposed. The performance of the CNN network is compared to the number of events detected using filtering and amplitude threshold (short-term average/long-term average) detection techniques. In the data set from which the real noise is taken, the network is able to detect >80% of the events identified by manual inspection and 14% more than detected by standard frequency-wavenumber filtering techniques. The false detection rate is approximately 2% or one event every 20 s. In other data sets, with monitoring geometries and conditions previously unseen by the network, >50% of events identified by manual inspection are detected by the CNN.


2020 ◽  
Vol 34 (35) ◽  
pp. 2050408
Author(s):  
Sumit Gupta ◽  
Dhirendra Pratap Singh

In today’s world scenario, many of the real-life problems and application data can be represented with the help of the graphs. Nowadays technology grows day by day at a very fast rate; applications generate a vast amount of valuable data, due to which the size of their representation graphs is increased. How to get meaningful information from these data become a hot research topic. Methodical algorithms are required to extract useful information from these raw data. These unstructured graphs are not scattered in nature, but these show some relationships between their basic entities. Identifying communities based on these relationships improves the understanding of the applications represented by graphs. Community detection algorithms are one of the solutions which divide the graph into small size clusters where nodes are densely connected within the cluster and sparsely connected across. During the last decade, there are lots of algorithms proposed which can be categorized into mainly two broad categories; non-overlapping and overlapping community detection algorithm. The goal of this paper is to offer a comparative analysis of the various community detection algorithms. We bring together all the state of art community detection algorithms related to these two classes into a single article with their accessible benchmark data sets. Finally, we represent a comparison of these algorithms concerning two parameters: one is time efficiency, and the other is how accurately the communities are detected.


2019 ◽  
Vol 7 (3) ◽  
pp. SE113-SE122 ◽  
Author(s):  
Yunzhi Shi ◽  
Xinming Wu ◽  
Sergey Fomel

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Yanjia Tian ◽  
Xiang Feng

With the explosive development of big data, information data mining technology has also been developed rapidly, and complex networks have become a hot research direction in data mining. In real life, many complex systems will use network nodes for intelligent detection. When many community detection algorithms are used, many problems have arisen, so they have to face improvement. The new detection algorithm CS-Cluster proposed in this paper is derived by using the dissimilarity of node proximity. Of course, the new algorithm proposed in this article is based on the IGC-CSM algorithm. It has made certain improvements, and CS-Cluster has been implemented in the four algorithms of IGC-CSM, SA-Cluster, W-Cluster, and S-Cluster. The result of comparing the density value on the entropy value of the Political Blogs data set, the DBLP data set, the Political Blogs data set, and the entropy value of the DBLP data set is shown. Finally, it is concluded that the CS-Cluster algorithm is the best in terms of the effect and quality of clustering, and the degree of difference in the subgraph structure of clustering.


2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Zsigmond Benkő ◽  
Tamás Bábel ◽  
Zoltán Somogyvári

AbstractRecognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called “unicorn” or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.


Author(s):  
Himansu Sekhar Pattanayak ◽  
Harsh K. Verma ◽  
Amrit Lal Sangal

Community detection is a pivotal part of network analysis and is classified as an NP-hard problem. In this paper, a novel community detection algorithm is proposed, which probabilistically predicts communities’ diameter using the local information of random seed nodes. The gravitation method is then applied to discover communities surrounding the seed nodes. The individual communities are combined to get the community structure of the whole network. The proposed algorithm, named as Local Gravitational community detection algorithm (LGCDA), can also work with overlapping communities. LGCDA algorithm is evaluated based on quality metrics and ground-truth data by comparing it with some of the widely used community detection algorithms using synthetic and real-world networks.


2014 ◽  
Vol 19 (4) ◽  
pp. 37-55 ◽  
Author(s):  
Sayan Mandal ◽  
Samit Biswas ◽  
Amit Kumar Das ◽  
Bhabatosh Chanda

Abstract Research on document image analysis is actively pursued in the last few decades and services like OCR, vectorization of drawings/graphics and various types of form processing are very common. Handwritten documents, old historical documents and documents captured through camera are now being the subjects of active research. However, another very important type of paper document, namely the map document image processing research suffers due to the inherent complexities of the map document and also for nonavailability of benchmark public data-sets. This paper presents a new data-set, namely, the Land Map Image Database (LMIDb) that consists of a variety of land maps images (446 images at present and growing; scanned at 200/300 dpi in TIF format) and the corresponding ground-truth. Using semiautomatic tools non-text part of the images are deleted and the text-only ground-truth is also kept in the database. This paper also presents a classification strategy for map images using which the maps in the database are automatically classified into Political (Po), Physical (Ph), Resource (R) and Topographic (T) maps. The automatic classification of maps help indexing of the images in LMIDb for archival and easy retrieval of the right maps to get the appropriate geographical information. Classification accuracy is also tested on the proposed data-set and the result is encouraging.


Sign in / Sign up

Export Citation Format

Share Document