Parameters Selection of LLE Algorithm for Classification Tasks

2014 ◽  
Vol 1037 ◽  
pp. 422-427 ◽  
Author(s):  
Feng Hu ◽  
Chuan Tong Wang ◽  
Yu Chuan Wu ◽  
Liang Zhi Fan

The crux in the locally linear embedding algorithm or LLE is the selection of embedding dimensionality and neighborhood size. A method of parameters selection based on the normalized cut criterion or Ncut for classification tasks is proposed. Differing from current techniques based on the neighborhood topology preservation criterion, the proposed method capitalizes on class separability of embedding result. By taking it into consideration, the intrinsic capability of LLE can be more faithfully reflected, and hence more rational features for classification in real-life applications can be offered. The theoretical argument is supported by experimental results from synthetic and real data sets.

Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2021 ◽  
Vol 15 ◽  
Author(s):  
Tianyu Liu ◽  
Zhixiong Xu ◽  
Lei Cao ◽  
Guowei Tan

Hybrid-modality brain-computer Interfaces (BCIs), which combine motor imagery (MI) bio-signals and steady-state visual evoked potentials (SSVEPs), has attracted wide attention in the research field of neural engineering. The number of channels should be as small as possible for real-life applications. However, most of recent works about channel selection only focus on either the performance of classification task or the effectiveness of device control. Few works conduct channel selection for MI and SSVEP classification tasks simultaneously. In this paper, a multitasking-based multiobjective evolutionary algorithm (EMMOA) was proposed to select appropriate channels for these two classification tasks at the same time. Moreover, a two-stage framework was introduced to balance the number of selected channels and the classification accuracy in the proposed algorithm. The experimental results verified the feasibility of multiobjective optimization methodology for channel selection of hybrid BCI tasks.


Author(s):  
Wahid A. M. Shehata ◽  
Haitham Yousof ◽  
Mohamed Aboraya

This paper presents a novel two-parameter G family of distributions. Relevant statistical properties such as the ordinary moments, incomplete moments and moment generating function are derived.  Using common copulas, some new bivariate type G families are derived. Special attention is devoted to the standard exponential base line model. The density of the new exponential extension can be “asymmetric and right skewed shape” with no peak, “asymmetric right skewed shape” with one peak, “symmetric shape” and “asymmetric left skewed shape” with one peak. The hazard rate of the new exponential distribution can be “increasing”, “U-shape”, “decreasing” and “J-shape”. The usefulness and flexibility of the new family is illustrated by means of two applications to real data sets. The new family is compared with many common G families in modeling relief times and survival times data sets.


2021 ◽  
Author(s):  
Fatma Zohra Seghier ◽  
Halim Zeghdoudi

Abstract In this paper, a Poisson XLindley distribution (PXLD) has been obtained by compounding Poisson (PD) distribution with a continuous distribution. A general expression for its rth factorial moment about origin has been derived and hence its raw moments and central moments are obtained. The expressions for its coefficient of variation, skewness, kurtosis and index of dispersion have also been given. In particular, the method of maximum likelihood and the method of moments for the estimation of its parameters have been discussed. Finally, two real-life data sets are analyzed to investigate the suitability of the proposed distribution in modeling a real data set on Nipah virus infection, number of Hemocytometer yeast cell count data and epileptic seizure counts data.


2022 ◽  
Vol 8 (1) ◽  
pp. 1-32
Author(s):  
Sajid Hasan Apon ◽  
Mohammed Eunus Ali ◽  
Bishwamittra Ghosh ◽  
Timos Sellis

Social networks with location enabling technologies, also known as geo-social networks, allow users to share their location-specific activities and preferences through check-ins. A user in such a geo-social network can be attributed to an associated location (spatial), her preferences as keywords (textual), and the connectivity (social) with her friends. The fusion of social, spatial, and textual data of a large number of users in these networks provide an interesting insight for finding meaningful geo-social groups of users supporting many real-life applications, including activity planning and recommendation systems. In this article, we introduce a novel query, namely, Top- k Flexible Socio-Spatial Keyword-aware Group Query (SSKGQ), which finds the best k groups of varying sizes around different points of interest (POIs), where the groups are ranked based on the social and textual cohesiveness among members and spatial closeness with the corresponding POI and the number of members in the group. We develop an efficient approach to solve the SSKGQ problem based on our theoretical upper bounds on distance, social connectivity, and textual similarity. We prove that the SSKGQ problem is NP-Hard and provide an approximate solution based on our derived relaxed bounds, which run much faster than the exact approach by sacrificing the group quality slightly. Our extensive experiments on real data sets show the effectiveness of our approaches in different real-life settings.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1261
Author(s):  
Hassan M. Okasha ◽  
Heba S. Mohammed ◽  
Yuhlong Lio

Given a progressively type-II censored sample, the E-Bayesian estimates, which are the expected Bayesian estimates over the joint prior distributions of the hyper-parameters in the gamma prior distribution of the unknown Weibull rate parameter, are developed for any given function of unknown rate parameter under the square error loss function. In order to study the impact from the selection of hyper-parameters for the prior, three different joint priors of the hyper-parameters are utilized to establish the theoretical properties of the E-Bayesian estimators for four functions of the rate parameter, which include an identity function (that is, a rate parameter) as well as survival, hazard rate and quantile functions. A simulation study is also conducted to compare the three E-Bayesian and a Bayesian estimate as well as the maximum likelihood estimate for each of the four functions considered. Moreover, two real data sets from a medical study and industry life test, respectively, are used for illustration. Finally, concluding remarks are addressed.


2021 ◽  
Vol 7 ◽  
pp. e652
Author(s):  
Diana Martinez-Mosquera ◽  
Rosa Navarrete ◽  
Sergio Luján-Mora

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.


Author(s):  
Oseghale O. I. ◽  
Akomolafe A. A. ◽  
Gayawan E.

This work is focused on the four parameters Exponentiated Cubic Transmuted Weibull distribution which mostly found its application in reliability analysis most especially for data that are non-monotone and Bi-modal. Structural properties such as moment, moment generating function, Quantile function, Renyi entropy, and order statistics were investigated. The maximum likelihood estimation technique was used to estimate the parameters of the distribution. Application to two real-life data sets shows the applicability of the distribution in modeling real data.


Energies ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 1085
Author(s):  
Syed Naeem Haider ◽  
Qianchuan Zhao ◽  
Xueliang Li

Prediction of a battery’s health in data centers plays a significant role in Battery Management Systems (BMS). Data centers use thousands of batteries, and their lifespan ultimately decreases over time. Predicting battery’s degradation status is very critical, even before the first failure is encountered during its discharge cycle, which also turns out to be a very difficult task in real life. Therefore, a framework to improve Auto-Regressive Integrated Moving Average (ARIMA) accuracy for forecasting battery’s health with clustered predictors is proposed. Clustering approaches, such as Dynamic Time Warping (DTW) or k-shape-based, are beneficial to find patterns in data sets with multiple time series. The aspect of large number of batteries in a data center is used to cluster the voltage patterns, which are further utilized to improve the accuracy of the ARIMA model. Our proposed work shows that the forecasting accuracy of the ARIMA model is significantly improved by applying the results of the clustered predictor for batteries in a real data center. This paper presents the actual historical data of 40 batteries of the large-scale data center for one whole year to validate the effectiveness of the proposed methodology.


1993 ◽  
Vol 18 (1) ◽  
pp. 41-68 ◽  
Author(s):  
Ratna Nandakumar ◽  
William Stout

This article provides a detailed investigation of Stout’s statistical procedure (the computer program DIMTEST) for testing the hypothesis that an essentially unidimensional latent trait model fits observed binary item response data from a psychological test. One finding was that DIMTEST may fail to perform as desired in the presence of guessing when coupled with many high-discriminating items. A revision of DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised to determine the size of the assessment subtests. Further, an adjustment is made on the estimated standard error of the statistic on which DIMTEST depends. These three refinements have led to an improved procedure that is shown in simulation studies to adhere closely to the nominal level of signficance while achieving considerably greater power. Finally, DIMTEST is validated on a selection of real data sets.


Sign in / Sign up

Export Citation Format

Share Document