PERSONAL NAME ALIASES ON AUTOMATIC DISCOVERY FROM THE WEB

In this paper we proposed a lexical-pattern-based approach to extract aliases of a given name. We use a set of names and their aliases as training data to extract lexical patterns that describe numerous ways in which information related to aliases of a name is presented on the web. An individual is typically referred by numerous name aliases on the web. Accurate identification of aliases of a given person name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name disambiguation, and relation extraction. We propose a method to extract aliases of a given personal name from the web. Given a personal name, the proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a candidate being a correct alias of the given name. We evaluate the proposed method on three data sets: an English personal names data set, an English place names data set, and a Japanese personal names data set. The proposed method outperforms numerous baselines and previously proposed name alias extraction methods, achieving a statistically significant mean reciprocal rank (MRR) of 0.67.

Download Full-text

A Classification Detection Algorithm Based on Joint Entropy Vector against Application-Layer DDoS Attack

Security and Communication Networks ◽

10.1155/2018/9463653 ◽

2018 ◽

Vol 2018 ◽

pp. 1-8 ◽

Cited By ~ 5

Author(s):

Yuntao Zhao ◽

Wenbo Zhang ◽

Yongxin Feng ◽

Bo Yu

Keyword(s):

Denial Of Service ◽

Detection Algorithm ◽

Attack Detection ◽

Training Data ◽

Joint Entropy ◽

Application Layer ◽

Accurate Identification ◽

Data Set ◽

Ddos Attack ◽

The Web

The application-layer distributed denial of service (AL-DDoS) attack makes a great threat against cyberspace security. The attack detection is an important part of the security protection, which provides effective support for defense system through the rapid and accurate identification of attacks. According to the attacker’s different URL of the Web service, the AL-DDoS attack is divided into three categories, including a random URL attack and a fixed and a traverse one. In order to realize identification of attacks, a mapping matrix of the joint entropy vector is constructed. By defining and computing the value of EUPI and jEIPU, a visual coordinate discrimination diagram of entropy vector is proposed, which also realizes data dimension reduction from N to two. In terms of boundary discrimination and the region where the entropy vectors fall in, the class of AL-DDoS attack can be distinguished. Through the study of training data set and classification, the results show that the novel algorithm can effectively distinguish the web server DDoS attack from normal burst traffic.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

INTEGRATED SHORELINE EXTRACTION APPROACH WITH USE OF RASAT MS AND SENTINEL-1A SAR IMAGES

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-iv-2-w4-445-2017 ◽

2017 ◽

Vol IV-2/W4 ◽

pp. 445-449 ◽

Cited By ~ 4

Author(s):

N. Demir ◽

S. Oy ◽

F. Erdem ◽

D. Z. Şeker ◽

B. Bayram

Keyword(s):

Water Body ◽

Accuracy Assessment ◽

Extraction Methods ◽

Training Data ◽

Sar Images ◽

Data Set ◽

Random Forest Classification ◽

Forest Classification ◽

Body Segmentation ◽

Land Water

Shorelines are complex ecosystems and highly important socio-economic environments. They may change rapidly due to both natural and human-induced effects. Determination of movements along the shoreline and monitoring of the changes are essential for coastline management, modeling of sediment transportation and decision support systems. Remote sensing provides an opportunity to obtain rapid, up-to-date and reliable information for monitoring of shoreline. In this study, approximately 120&thinsp;km of Antalya-Kemer shoreline which is under the threat of erosion, deposition, increasing of inhabitants and urbanization and touristic hotels, has been selected as the study area. In the study, RASAT pansharpened and SENTINEL-1A SAR images have been used to implement proposed shoreline extraction methods. The main motivation of this study is to combine the land/water body segmentation results of both RASAT MS and SENTINEL-1A SAR images to improve the quality of the results. The initial land/water body segmentation has been obtained using RASAT image by means of Random Forest classification method. This result has been used as training data set to define fuzzy parameters for shoreline extraction from SENTINEL-1A SAR image. Obtained results have been compared with the manually digitized shoreline. The accuracy assessment has been performed by calculating perpendicular distances between reference data and extracted shoreline by proposed method. As a result, the mean difference has been calculated around 1 pixel.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Simple Convolutional-Based Models: Are They Learning the Task or the Data?

Neural Computation ◽

10.1162/neco_a_01446 ◽

2021 ◽

pp. 1-17

Author(s):

Luis Sa-Couto ◽

Andreas Wichert

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Training Data ◽

Model Complexity ◽

Data Sets ◽

Simple Task ◽

Data Set ◽

Knowing That ◽

Handwritten Digit ◽

End To End

Abstract Convolutional neural networks (CNNs) evolved from Fukushima's neocognitron model, which is based on the ideas of Hubel and Wiesel about the early stages of the visual cortex. Unlike other branches of neocognitron-based models, the typical CNN is based on end-to-end supervised learning by backpropagation and removes the focus from built-in invariance mechanisms, using pooling not as a way to tolerate small shifts but as a regularization tool that decreases model complexity. These properties of end-to-end supervision and flexibility of structure allow the typical CNN to become highly tuned to the training data, leading to extremely high accuracies on typical visual pattern recognition data sets. However, in this work, we hypothesize that there is a flip side to this capability, a hidden overfitting. More concretely, a supervised, backpropagation based CNN will outperform a neocognitron/map transformation cascade (MTCCXC) when trained and tested inside the same data set. Yet if we take both models trained and test them on the same task but on another data set (without retraining), the overfitting appears. Other neocognitron descendants like the What-Where model go in a different direction. In these models, learning remains unsupervised, but more structure is added to capture invariance to typical changes. Knowing that, we further hypothesize that if we repeat the same experiments with this model, the lack of supervision may make it worse than the typical CNN inside the same data set, but the added structure will make it generalize even better to another one. To put our hypothesis to the test, we choose the simple task of handwritten digit classification and take two well-known data sets of it: MNIST and ETL-1. To try to make the two data sets as similar as possible, we experiment with several types of preprocessing. However, regardless of the type in question, the results align exactly with expectation.

Download Full-text

On Semantic Relation Extraction Over Enterprise Data

Innovations, Developments, and Applications of Semantic Web and Information Systems - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-5042-6.ch003 ◽

2018 ◽

pp. 62-84 ◽

Cited By ~ 2

Author(s):

Wei Shen ◽

Jianyong Wang ◽

Ping Luo ◽

Min Wang

Keyword(s):

Statistical Method ◽

Real World ◽

Relation Extraction ◽

Semantic Relation ◽

Experimental Results ◽

Web Data ◽

Data Set ◽

Redundancy Level ◽

Hybrid Framework ◽

The Web

Relation extraction from the Web data has attracted a lot of attention recently. However, little work has been done when it comes to the enterprise data regardless of the urgent needs to such work in real applications (e.g., E-discovery). One distinct characteristic of the enterprise data (in comparison with the Web data) is its low redundancy. Previous work on relation extraction from the Web data largely relies on the data's high redundancy level and thus cannot be applied to the enterprise data effectively. This chapter reviews related work on relation extraction and introduces an unsupervised hybrid framework REACTOR for semantic relation extraction over enterprise data. REACTOR combines a statistical method, classification, and clustering to identify various types of relations among entities appearing in the enterprise data automatically. REACTOR was evaluated over a real-world enterprise data set from HP that contains over three million pages and the experimental results show its effectiveness.

Download Full-text

Extended Odd Fréchet-G Family of Distributions

Journal of Probability and Statistics ◽

10.1155/2018/2931326 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Suleman Nasiru

Keyword(s):

Maximum Likelihood Method ◽

Real Data ◽

Likelihood Method ◽

Data Sets ◽

Data Set ◽

New Class ◽

New Family ◽

Rate Functions ◽

The Given ◽

Family Of Distributions

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.

Download Full-text

An assessment of Bayesian bias estimator for numerical weather prediction

Nonlinear Processes in Geophysics ◽

10.5194/npg-15-1013-2008 ◽

2008 ◽

Vol 15 (6) ◽

pp. 1013-1022 ◽

Cited By ~ 2

Author(s):

J. Son ◽

D. Hou ◽

Z. Toth

Keyword(s):

Time Series ◽

Numerical Weather Prediction ◽

Sampling Error ◽

Weather Prediction ◽

Training Data ◽

Statistical Characteristics ◽

Forecast Errors ◽

Data Sets ◽

Data Set ◽

Numerical Weather

Abstract. Various statistical methods are used to process operational Numerical Weather Prediction (NWP) products with the aim of reducing forecast errors and they often require sufficiently large training data sets. Generating such a hindcast data set for this purpose can be costly and a well designed algorithm should be able to reduce the required size of these data sets. This issue is investigated with the relatively simple case of bias correction, by comparing a Bayesian algorithm of bias estimation with the conventionally used empirical method. As available forecast data sets are not large enough for a comprehensive test, synthetically generated time series representing the analysis (truth) and forecast are used to increase the sample size. Since these synthetic time series retained the statistical characteristics of the observations and operational NWP model output, the results of this study can be extended to real observation and forecasts and this is confirmed by a preliminary test with real data. By using the climatological mean and standard deviation of the meteorological variable in consideration and the statistical relationship between the forecast and the analysis, the Bayesian bias estimator outperforms the empirical approach in terms of the accuracy of the estimated bias, and it can reduce the required size of the training sample by a factor of 3. This advantage of the Bayesian approach is due to the fact that it is less liable to the sampling error in consecutive sampling. These results suggest that a carefully designed statistical procedure may reduce the need for the costly generation of large hindcast datasets.

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

PUB-SalNet: A Pre-Trained Unsupervised Self-Aware Backpropagation Network for Biomedical Salient Segmentation

Algorithms ◽

10.3390/a13050126 ◽

2020 ◽

Vol 13 (5) ◽

pp. 126 ◽

Cited By ~ 1

Author(s):

Feiyang Chen ◽

Ying Jiang ◽

Xiangrui Zeng ◽

Jing Zhang ◽

Xin Gao ◽

...

Keyword(s):

Training Data ◽

Data Sets ◽

Local Similarity ◽

Biomedical Data ◽

Data Set ◽

Biomedical Image ◽

Feature Representations ◽

3D Data ◽

Supervised Methods ◽

Data Driven Decisions

Salient segmentation is a critical step in biomedical image analysis, aiming to cut out regions that are most interesting to humans. Recently, supervised methods have achieved promising results in biomedical areas, but they depend on annotated training data sets, which requires labor and proficiency in related background knowledge. In contrast, unsupervised learning makes data-driven decisions by obtaining insights directly from the data themselves. In this paper, we propose a completely unsupervised self-aware network based on pre-training and attentional backpropagation for biomedical salient segmentation, named as PUB-SalNet. Firstly, we aggregate a new biomedical data set from several simulated Cellular Electron Cryo-Tomography (CECT) data sets featuring rich salient objects, different SNR settings, and various resolutions, which is called SalSeg-CECT. Based on the SalSeg-CECT data set, we then pre-train a model specially designed for biomedical tasks as a backbone module to initialize network parameters. Next, we present a U-SalNet network to learn to selectively attend to salient objects. It includes two types of attention modules to facilitate learning saliency through global contrast and local similarity. Lastly, we jointly refine the salient regions together with feature representations from U-SalNet, with the parameters updated by self-aware attentional backpropagation. We apply PUB-SalNet for analysis of 2D simulated and real images and achieve state-of-the-art performance on simulated biomedical data sets. Furthermore, our proposed PUB-SalNet can be easily extended to 3D images. The experimental results on the 2d and 3d data sets also demonstrate the generalization ability and robustness of our method.

Download Full-text