Parameters Selection of LLE Algorithm for Classification Tasks

The crux in the locally linear embedding algorithm or LLE is the selection of embedding dimensionality and neighborhood size. A method of parameters selection based on the normalized cut criterion or Ncut for classification tasks is proposed. Differing from current techniques based on the neighborhood topology preservation criterion, the proposed method capitalizes on class separability of embedding result. By taking it into consideration, the intrinsic capability of LLE can be more faithfully reflected, and hence more rational features for classification in real-life applications can be offered. The theoretical argument is supported by experimental results from synthetic and real data sets.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Evolutionary Multitasking-Based Multiobjective Optimization Algorithm for Channel Selection in Hybrid Brain Computer Interfacing Systems

Frontiers in Neuroscience ◽

10.3389/fnins.2021.749232 ◽

2021 ◽

Vol 15 ◽

Author(s):

Tianyu Liu ◽

Zhixiong Xu ◽

Lei Cao ◽

Guowei Tan

Keyword(s):

Multiobjective Optimization ◽

Real Life ◽

Channel Selection ◽

Research Field ◽

Neural Engineering ◽

Computer Interfaces ◽

Classification Tasks ◽

Optimization Methodology ◽

Brain Computer Interfacing ◽

Selection Of

Hybrid-modality brain-computer Interfaces (BCIs), which combine motor imagery (MI) bio-signals and steady-state visual evoked potentials (SSVEPs), has attracted wide attention in the research field of neural engineering. The number of channels should be as small as possible for real-life applications. However, most of recent works about channel selection only focus on either the performance of classification task or the effectiveness of device control. Few works conduct channel selection for MI and SSVEP classification tasks simultaneously. In this paper, a multitasking-based multiobjective evolutionary algorithm (EMMOA) was proposed to select appropriate channels for these two classification tasks at the same time. Moreover, a two-stage framework was introduced to balance the number of selected channels and the classification accuracy in the proposed algorithm. The experimental results verified the feasibility of multiobjective optimization methodology for channel selection of hybrid BCI tasks.

Download Full-text

A Novel Generator of Continuous Probability Distributions for the Asymmetric Left-skewed Bimodal Real-life Data with Properties and Copulas

Pakistan Journal of Statistics and Operation Research ◽

10.18187/pjsor.v17i4.3903 ◽

2021 ◽

pp. 943-961 ◽

Cited By ~ 1

Author(s):

Wahid A. M. Shehata ◽

Haitham Yousof ◽

Mohamed Aboraya

Keyword(s):

Probability Distributions ◽

Real Life ◽

Real Data ◽

Moment Generating Function ◽

Data Sets ◽

Base Line ◽

Survival Times ◽

New Family ◽

Real Life Data ◽

Two Parameter

This paper presents a novel two-parameter G family of distributions. Relevant statistical properties such as the ordinary moments, incomplete moments and moment generating function are derived. Using common copulas, some new bivariate type G families are derived. Special attention is devoted to the standard exponential base line model. The density of the new exponential extension can be “asymmetric and right skewed shape” with no peak, “asymmetric right skewed shape” with one peak, “symmetric shape” and “asymmetric left skewed shape” with one peak. The hazard rate of the new exponential distribution can be “increasing”, “U-shape”, “decreasing” and “J-shape”. The usefulness and flexibility of the new family is illustrated by means of two applications to real data sets. The new family is compared with many common G families in modeling relief times and survival times data sets.

Download Full-text

A Poisson XLindley Distribution with Applications

10.21203/rs.3.rs-343104/v1 ◽

2021 ◽

Author(s):

Fatma Zohra Seghier ◽

Halim Zeghdoudi

Keyword(s):

Method Of Moments ◽

Nipah Virus ◽

Continuous Distribution ◽

Real Life ◽

Factorial Moment ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Real Life Data ◽

Method Of Maximum Likelihood

Abstract In this paper, a Poisson XLindley distribution (PXLD) has been obtained by compounding Poisson (PD) distribution with a continuous distribution. A general expression for its rth factorial moment about origin has been derived and hence its raw moments and central moments are obtained. The expressions for its coefficient of variation, skewness, kurtosis and index of dispersion have also been given. In particular, the method of maximum likelihood and the method of moments for the estimation of its parameters have been discussed. Finally, two real-life data sets are analyzed to investigate the suitability of the proposed distribution in modeling a real data set on Nipah virus infection, number of Hemocytometer yeast cell count data and epileptic seizure counts data.

Download Full-text

Social-Spatial Group Queries with Keywords

ACM Transactions on Spatial Algorithms and Systems ◽

10.1145/3475962 ◽

2022 ◽

Vol 8 (1) ◽

pp. 1-32

Author(s):

Sajid Hasan Apon ◽

Mohammed Eunus Ali ◽

Bishwamittra Ghosh ◽

Timos Sellis

Keyword(s):

Social Networks ◽

Social Groups ◽

Real Life ◽

Real Data ◽

Data Sets ◽

Interesting Insight ◽

Points Of Interest ◽

The Social ◽

Textual Data ◽

Aware Group

Social networks with location enabling technologies, also known as geo-social networks, allow users to share their location-specific activities and preferences through check-ins. A user in such a geo-social network can be attributed to an associated location (spatial), her preferences as keywords (textual), and the connectivity (social) with her friends. The fusion of social, spatial, and textual data of a large number of users in these networks provide an interesting insight for finding meaningful geo-social groups of users supporting many real-life applications, including activity planning and recommendation systems. In this article, we introduce a novel query, namely, Top- k Flexible Socio-Spatial Keyword-aware Group Query (SSKGQ), which finds the best k groups of varying sizes around different points of interest (POIs), where the groups are ranked based on the social and textual cohesiveness among members and spatial closeness with the corresponding POI and the number of members in the group. We develop an efficient approach to solve the SSKGQ problem based on our theoretical upper bounds on distance, social connectivity, and textual similarity. We prove that the SSKGQ problem is NP-Hard and provide an approximate solution based on our derived relaxed bounds, which run much faster than the exact approach by sacrificing the group quality slightly. Our extensive experiments on real data sets show the effectiveness of our approaches in different real-life settings.

Download Full-text

E-Bayesian Estimation of Reliability Characteristics of a Weibull Distribution with Applications

Mathematics ◽

10.3390/math9111261 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1261

Author(s):

Hassan M. Okasha ◽

Heba S. Mohammed ◽

Yuhlong Lio

Keyword(s):

Real Data ◽

Data Sets ◽

Rate Parameter ◽

Bayesian Estimates ◽

Medical Study ◽

Reliability Characteristics ◽

Censored Sample ◽

The Impact ◽

Type Ii Censored Sample ◽

Selection Of

Given a progressively type-II censored sample, the E-Bayesian estimates, which are the expected Bayesian estimates over the joint prior distributions of the hyper-parameters in the gamma prior distribution of the unknown Weibull rate parameter, are developed for any given function of unknown rate parameter under the square error loss function. In order to study the impact from the selection of hyper-parameters for the prior, three different joint priors of the hyper-parameters are utilized to establish the theoretical properties of the E-Bayesian estimators for four functions of the rate parameter, which include an identity function (that is, a rate parameter) as well as survival, hazard rate and quantile functions. A simulation study is also conducted to compare the three E-Bayesian and a Bayesian estimate as well as the maximum likelihood estimate for each of the four functions considered. Moreover, two real data sets from a medical study and industry life test, respectively, are used for illustration. Finally, concluding remarks are addressed.

Download Full-text

Efficient processing of complex XSD using Hive and Spark

PeerJ Computer Science ◽

10.7717/peerj-cs.652 ◽

2021 ◽

Vol 7 ◽

pp. e652

Author(s):

Diana Martinez-Mosquera ◽

Rosa Navarrete ◽

Sergio Luján-Mora

Keyword(s):

Big Data ◽

Performance Management ◽

Mobile Networks ◽

Real Life ◽

Real Data ◽

Xml Schema ◽

Apache Spark ◽

Data Sets ◽

Apache Hive

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.

Download Full-text

Exponentiated Cubic Transmuted Weibull Distribution: Properties and Application

Academic Journal of Applied Mathematical Sciences ◽

10.32861/ajams.81.1.11 ◽

2021 ◽

pp. 1-11

Author(s):

Oseghale O. I. ◽

Akomolafe A. A. ◽

Gayawan E.

Keyword(s):

Weibull Distribution ◽

Real Life ◽

Likelihood Estimation ◽

Real Data ◽

Quantile Function ◽

Moment Generating Function ◽

Data Sets ◽

Estimation Technique ◽

Life Data ◽

Real Life Data

This work is focused on the four parameters Exponentiated Cubic Transmuted Weibull distribution which mostly found its application in reliability analysis most especially for data that are non-monotone and Bi-modal. Structural properties such as moment, moment generating function, Quantile function, Renyi entropy, and order statistics were investigated. The maximum likelihood estimation technique was used to estimate the parameters of the distribution. Application to two real-life data sets shows the applicability of the distribution in modeling real data.

Download Full-text

Cluster-Based Prediction for Batteries in Data Centers

Energies ◽

10.3390/en13051085 ◽

2020 ◽

Vol 13 (5) ◽

pp. 1085

Author(s):

Syed Naeem Haider ◽

Qianchuan Zhao ◽

Xueliang Li

Keyword(s):

Data Center ◽

Large Scale ◽

Data Centers ◽

Moving Average ◽

Arima Model ◽

Real Life ◽

Real Data ◽

Data Sets ◽

Multiple Time ◽

Battery Management

Prediction of a battery’s health in data centers plays a significant role in Battery Management Systems (BMS). Data centers use thousands of batteries, and their lifespan ultimately decreases over time. Predicting battery’s degradation status is very critical, even before the first failure is encountered during its discharge cycle, which also turns out to be a very difficult task in real life. Therefore, a framework to improve Auto-Regressive Integrated Moving Average (ARIMA) accuracy for forecasting battery’s health with clustered predictors is proposed. Clustering approaches, such as Dynamic Time Warping (DTW) or k-shape-based, are beneficial to find patterns in data sets with multiple time series. The aspect of large number of batteries in a data center is used to cluster the voltage patterns, which are further utilized to improve the accuracy of the ARIMA model. Our proposed work shows that the forecasting accuracy of the ARIMA model is significantly improved by applying the results of the clustered predictor for batteries in a real data center. This paper presents the actual historical data of 40 batteries of the large-scale data center for one whole year to validate the effectiveness of the proposed methodology.

Download Full-text

Refinements of Stout’s Procedure for Assessing Latent Trait Unidimensionality

Journal of Educational Statistics ◽

10.3102/10769986018001041 ◽

1993 ◽

Vol 18 (1) ◽

pp. 41-68 ◽

Cited By ~ 12

Author(s):

Ratna Nandakumar ◽

William Stout

Keyword(s):

Item Response ◽

Latent Trait ◽

Real Data ◽

Data Sets ◽

Simulation Studies ◽

Nominal Level ◽

Response Data ◽

Latent Trait Model ◽

Binary Item ◽

Selection Of

This article provides a detailed investigation of Stout’s statistical procedure (the computer program DIMTEST) for testing the hypothesis that an essentially unidimensional latent trait model fits observed binary item response data from a psychological test. One finding was that DIMTEST may fail to perform as desired in the presence of guessing when coupled with many high-discriminating items. A revision of DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised to determine the size of the assessment subtests. Further, an adjustment is made on the estimated standard error of the statistic on which DIMTEST depends. These three refinements have led to an improved procedure that is shown in simulation studies to adhere closely to the nominal level of signficance while achieving considerably greater power. Finally, DIMTEST is validated on a selection of real data sets.

Download Full-text