high dimensionality
Recently Published Documents


TOTAL DOCUMENTS

313
(FIVE YEARS 115)

H-INDEX

20
(FIVE YEARS 6)

2022 ◽  
Vol 11 (2) ◽  
pp. 1-22
Author(s):  
Abha Jain ◽  
Ankita Bansal

The need of the customers to be connected to the network at all times has led to the evolution of mobile technology. Operating systems play a vitol role when we talk of technology. Nowadays, Android is one of the popularly used operating system in mobile phones. Authors have analysed three stable versions of Android, 6.0, 7.0 and 8.0. Incorporating a change in the version after it is released requires a lot of rework and thus huge amount of costs are incurred. In this paper, the aim is to reduce this rework by identifying certain parts of a version during early phase of development which need careful attention. Machine learning prediction models are developed to identify the parts which are more prone to changes. The accuracy of such models should be high as the developers heavily rely on them. The high dimensionality of the dataset may hamper the accuracy of the models. Thus, the authors explore four dimensionality reduction techniques, which are unexplored in the field of network and communication. The results concluded that the accuracy improves after reducing the features.


Author(s):  
Meng Yuan ◽  
Justin Zobel ◽  
Pauline Lin

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.


PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0255328
Author(s):  
Rolando Barajas ◽  
Brionna Hair ◽  
Gabriel Lai ◽  
Melissa Rotunno ◽  
Marissa M. Shams-White ◽  
...  

Systems epidemiology offers a more comprehensive and holistic approach to studies of cancer in populations by considering high dimensionality measures from multiple domains, assessing the inter-relationships among risk factors, and considering changes over time. These approaches offer a framework to account for the complexity of cancer and contribute to a broader understanding of the disease. Therefore, NCI sponsored a workshop in February 2019 to facilitate discussion about the opportunities and challenges of the application of systems epidemiology approaches for cancer research. Eight key themes emerged from the discussion: transdisciplinary collaboration and a problem-based approach; methods and modeling considerations; interpretation, validation, and evaluation of models; data needs and opportunities; sharing of data and models; enhanced training practices; dissemination of systems models; and building a systems epidemiology community. This manuscript summarizes these themes, highlights opportunities for cancer systems epidemiology research, outlines ways to foster this research area, and introduces a collection of papers, “Cancer System Epidemiology Insights and Future Opportunities” that highlight findings based on systems epidemiology approaches.


2021 ◽  
Vol 7 ◽  
pp. e832
Author(s):  
Barbara Pes ◽  
Giuseppina Lai

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.


2021 ◽  
Vol 24 ◽  
pp. 45-52
Author(s):  
Jana Busa ◽  
Inese Polaka

The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To evaluate the accuracy of the model, measures of the area under the ROC curve and the overall accuracy of the classifier were used. In the experiments, the authors examined how classification results were affected by feature selection and increased size of the data set.


2021 ◽  
Author(s):  
A B Pawar ◽  
M A Jawale ◽  
Ravi Kumar Tirandasu ◽  
Saiprasad Potharaju

High dimensionality is the serious issue in the preprocessing of data mining. Having large number of features in the dataset leads to several complications for classifying an unknown instance. In a initial dataspace there may be redundant and irrelevant features present, which leads to high memory consumption, and confuse the learning model created with those properties of features. Always it is advisable to select the best features and generate the classification model for better accuracy. In this research, we proposed a novel feature selection approach and Symmetrical uncertainty and Correlation Coefficient (SU-CCE) for reducing the high dimensional feature space and increasing the classification accuracy. The experiment is performed on colon cancer microarray dataset which has 2000 features. The proposed method derived 38 best features from it. To measure the strength of proposed method, top 38 features extracted by 4 traditional filter-based methods are compared with various classifiers. After careful investigation of result, the proposed approach is competing with most of the traditional methods.


2021 ◽  
Vol 922 (2) ◽  
pp. 228
Author(s):  
Yu-Yang Songsheng ◽  
Yi-Qian Qian ◽  
Yan-Rong Li ◽  
Pu Du ◽  
Jie-Wen Chen ◽  
...  

Abstract Detecting continuous nanohertz gravitational waves (GWs) generated by individual close binaries of supermassive black holes (CB-SMBHs) is one of the primary objectives of pulsar timing arrays (PTAs). The detection sensitivity is slated to increase significantly as the number of well-timed millisecond pulsars will increase by more than an order of magnitude with the advent of next-generation radio telescopes. Currently, the Bayesian analysis pipeline using parallel tempering Markov Chain Monte Carlo has been applied in multiple studies for CB-SMBH searches, but it may be challenged by the high dimensionality of the parameter space for future large-scale PTAs. One solution is to reduce the dimensionality by maximizing or marginalizing over uninformative parameters semianalytically, but it is not clear whether this approach can be extended to more complex signal models without making overly simplified assumptions. Recently, the method of diffusive nested (DNest) sampling has shown capability in coping with high dimensionality and multimodality effectively in Bayesian analysis. In this paper, we apply DNest to search for continuous GWs in simulated pulsar timing residuals and find that it performs well in terms of accuracy, robustness, and efficiency for a PTA including  ( 10 2 ) pulsars. DNest also allows a simultaneous search of multiple sources elegantly, which demonstrates its scalability and general applicability. Our results show that it is convenient and also highly beneficial to include DNest in current toolboxes of PTA analysis.


Author(s):  
Maria Mohammad Yousef ◽  

Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.


Sign in / Sign up

Export Citation Format

Share Document