Semi-supervised learning with summary statistics

2019 ◽  
Vol 17 (05) ◽  
pp. 837-851
Author(s):  
Huihui Qin ◽  
Xin Guo

Nowadays, the extensive collection and analyzing of data is stimulating widespread privacy concerns, and therefore is increasing tensions between the potential sources of data and researchers. A privacy-friendly learning framework can help to ease the tensions, and to free up more data for research. We propose a new algorithm, LESS (Learning with Empirical feature-based Summary statistics from Semi-supervised data), which uses only summary statistics instead of raw data for regression learning. The selection of empirical features serves as a trade-off between prediction precision and the protection of privacy. We show that LESS achieves the minimax optimal rate of convergence in terms of the size of the labeled sample. LESS extends naturally to the applications where data are separately held by different sources. Compared with the existing literature on distributed learning, LESS removes the restriction of minimum sample size on single data sources.

2021 ◽  
Vol 25 (4) ◽  
pp. 1013-1029
Author(s):  
Zeeshan Zeeshan ◽  
Qurat ul Ain ◽  
Uzair Aslam Bhatti ◽  
Waqar Hussain Memon ◽  
Sajid Ali ◽  
...  

With the increase of online businesses, recommendation algorithms are being researched a lot to facilitate the process of using the existing information. Such multi-criteria recommendation (MCRS) helps a lot the end-users to attain the required results of interest having different selective criteria – such as combinations of implicit and explicit interest indicators in the form of ranking or rankings on different matched dimensions. Current approaches typically use label correlation, by assuming that the label correlations are shared by all objects. In real-world tasks, however, different sources of information have different features. Recommendation systems are more effective if being used for making a recommendation using multiple criteria of decisions by using the correlation between the features and items content (content-based approach) or finding a similar user rating to get targeted results (Collaborative filtering). To combine these two filterings in the multicriteria model, we proposed a features-based fb-knn multi-criteria hybrid recommendation algorithm approach for getting the recommendation of the items by using multicriteria features of items and integrating those with the correlated items found in similar datasets. Ranks were assigned to each decision and then weights were computed for each decision by using the standard deviation of items to get the nearest result. For evaluation, we tested the proposed algorithm on different datasets having multiple features of information. The results demonstrate that proposed fb-knn is efficient in different types of datasets.


2019 ◽  
Vol 490 (3) ◽  
pp. 4107-4120
Author(s):  
J Bentley ◽  
C G Tinney ◽  
S Sharma ◽  
D Wright

ABSTRACT We present criteria for the selection of M-dwarfs down to G < 14.5 using all-sky survey data, with a view to identifying potential M-dwarfs, to be confirmed spectroscopically by the FunnelWeb survey. Two sets of criteria were developed. The first, based on absolute magnitude in the Gaia G passband, with MG > 7.7, selects 76,392 stars, with 81.0 per cent expected to be M-dwarfs at a completeness of >97 per cent. The second is based on colour and uses Gaia, WISE, and 2MASS all-sky photometry. This criteria identifies 94,479 candidate M-dwarfs, of which between 29.4 per cent and 47.3 per cent are expected to be true M-dwarfs, and which contains 99.6 per cent of expected M-dwarfs. Both criteria were developed using synthetic galaxy model predictions, and a previously spectroscopically classified set of M- and K-dwarfs, to evaluate both M-dwarf completeness and false-positive detections (i.e. the non-M-dwarf contamination rate). Both criteria used in combination demonstrate how each excludes different sources of contamination. We therefore developed a final set of criteria that combines absolute magnitude and colour selection to identify 74,091 stars. All these sets of criteria select numbers of objects feasible for confirmation via massively multiplexed spectroscopic surveys like FunnelWeb.


2019 ◽  
Vol 16 (05) ◽  
pp. 1950029
Author(s):  
Mohammed Abdul Rahman AlShehri ◽  
Shailendra Mishra

Software defined network (SDN) controller selection in SDN is a key challenge to the network administrator. In SDN, control plane is an isolated process and operate on control layer. The controller provides a universal view of the entire network and support applications and services. The three focused parameters for controller selection are productivity, campus network and open source. In SDN, it is vital to have a good device for the efficient processing of all requests made by the switch and for good behavior of the network. For selecting best controller for the specified parameters, decision logic has to be developed that allow us to do comparison of the available controllers. Therefore, in this research we have suggested a methodology that uses analytic-hierarchy-process (AHP) to find a best controller. The approach has been studied and verified for a big organization network setup of Al-Majmaah University, Saudi Arabia. The approach is found to be more effective and increase the network performance significantly.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Jia-Rou Liu ◽  
Po-Hsiu Kuo ◽  
Hung Hung

Large-p-small-ndatasets are commonly encountered in modern biomedical studies. To detect the difference between two groups, conventional methods would fail to apply due to the instability in estimating variances int-test and a high proportion of tied values in AUC (area under the receiver operating characteristic curve) estimates. The significance analysis of microarrays (SAM) may also not be satisfactory, since its performance is sensitive to the tuning parameter, and its selection is not straightforward. In this work, we propose a robust rerank approach to overcome the above-mentioned diffculties. In particular, we obtain a rank-based statistic for each feature based on the concept of “rank-over-variable.” Techniques of “random subset” and “rerank” are then iteratively applied to rank features, and the leading features will be selected for further studies. The proposed re-rank approach is especially applicable for large-p-small-ndatasets. Moreover, it is insensitive to the selection of tuning parameters, which is an appealing property for practical implementation. Simulation studies and real data analysis of pooling-based genome wide association (GWA) studies demonstrate the usefulness of our method.


Author(s):  
M. A. Abbas ◽  
H. Setan ◽  
Z. Majid ◽  
A. K. Chong ◽  
L. Chong Luh ◽  
...  

Similar to other electronic instruments, terrestrial laser scanner (TLS) can also inherent with various systematic errors coming from different sources. Self-calibration technique is a method available to investigate these errors for TLS which were adopted from photogrammetry technique. According to the photogrammetry principle, the selection of datum constraints can cause different types of parameter correlations. However, the network configuration applied by TLS and photogrammetry calibrations are quite different, thus, this study has investigated the significant of photogrammetry datum constraints principle in TLS self-calibration. To ensure that the assessment is thorough, the datum constraints analyses were carried out using three variant network configurations: 1) minimum number of scan stations; 2) minimum number of surfaces for targets distribution; and 3) minimum number of point targets. Based on graphical and statistical, the analyses of datum constraints selection indicated that the parameter correlations obtained are significantly similar. In addition, the analysis has demonstrated that network configuration is a very crucial factor to reduce the correlation between the calculated parameters.


2021 ◽  
pp. 1-48
Author(s):  
Zuchao Li ◽  
Hai Zhao ◽  
Shexia He ◽  
Jiaxun Cai

Abstract Semantic role labeling (SRL) is dedicated to recognizing the semantic predicate-argument structure of a sentence. Previous studies in terms of traditional models have shown syntactic information can make remarkable contributions to SRL performance; however, the necessity of syntactic information was challenged by a few recent neural SRL studies that demonstrate impressive performance without syntactic backbones and suggest that syntax information becomes much less important for neural semantic role labeling, especially when paired with recent deep neural network and large-scale pre-trained language models. Despite this notion, the neural SRL field still lacks a systematic and full investigation on the relevance of syntactic information in SRL, for both dependency and both monolingual and multilingual settings. This paper intends to quantify the importance of syntactic information for neural SRL in the deep learning framework. We introduce three typical SRL frameworks (baselines), sequence-based, tree-based, and graph-based, which are accompanied by two categories of exploiting syntactic information: syntax pruningbased and syntax feature-based. Experiments are conducted on the CoNLL-2005, 2009, and 2012 benchmarks for all languages available, and results show that neural SRL models can still benefit from syntactic information under certain conditions. Furthermore, we show the quantitative significance of syntax to neural SRL models together with a thorough empirical survey using existing models.


Author(s):  
Andrey Sergeevich Kopyrin ◽  
Irina Leonidovna Makarova

The subject of the research is the process of collecting and preliminary preparation of data from heterogeneous sources. Economic information is heterogeneous and semi-structured or unstructured in nature. Due to the heterogeneity of the primary documents, as well as the human factor, the initial statistical data may contain a large amount of noise, as well as records, the automatic processing of which may be very difficult. This makes preprocessing dynamic input data an important precondition for discovering meaningful patterns and domain knowledge, and making the research topic relevant.Data preprocessing is a series of unique tasks that have led to the emergence of various algorithms and heuristic methods for solving preprocessing tasks such as merge and cleanup, identification of variablesIn this work, a preprocessing algorithm is formulated that allows you to bring together into a single database and structure information on time series from different sources. The key modification of the preprocessing method proposed by the authors is the technology of automated data integration.The technology proposed by the authors involves the combined use of methods for constructing a fuzzy time series and machine lexical comparison on the thesaurus network, as well as the use of a universal database built using the MIVAR concept.The preprocessing algorithm forms a single data model with the ability to transform the periodicity and semantics of the data set and integrate data that can come from various sources into a single information bank.


Author(s):  
Salvador Miranda Lima ◽  
José Moreira

The emergence of the World Wide Web made available massive amounts of data. This data, created and disseminated from many different sources, is prepared and linked in a way that is well-suited for display purposes, but automation, integration, interoperability or context-oriented search can hardly be implemented. Hence, the Semantic Web aims at promoting global information integration and semantic interoperability, through the use of metadata, ontologies and inference mechanisms. This chapter presents a Semantic Model for Tourism (SeMoT), designed for building Semantic Web enabled applications for the planning and management of touristic itineraries, taking into account the new requirements of more demanding and culturally evolved tourists. It includes an introduction to relevant tourism concepts, an overview of current trends in Web Semantics research and a presentation of the architecture, main features and a selection of representative ontologies that compose the SeMoT.


2004 ◽  
Vol 49 (1) ◽  
pp. 53-59 ◽  
Author(s):  
I. Moreno-Andrade ◽  
G. Buitrón

Five different sources of inocula were studied to determine its influence on biodegradability tests. Inocula were characterized determining granulometry, specific methanogenic activity, solids content, and volumetric sludge index. Also, the fermentative, aceticlastic, hydrogenophilic, OPHA, and sulfate-reducing groups were determined by the most probable number technique. Anaerobic biodegradability tests were conducted with two different substrates, one easy to degrade (glucose) and a toxic one (phenol). The best performance, in terms of percent of biodegradation and lag time, for both substrates, was obtained with the inoculum from a brewery industry UASB. The results can be explained in terms of the initial activity of the inoculum. The influence of the significant variations found in the specific methanogenic activity of the five inocula studied is discussed, in terms of the microbial composition of the samples. The results emphasized the importance of the selection of an appropriate source of inoculum in order to obtain reliable results.


Sign in / Sign up

Export Citation Format

Share Document