scholarly journals Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 348
Author(s):  
Zahra Tayebi ◽  
Sarwan Ali ◽  
Murray Patterson

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1 scores for the clusters and also better clustering quality metrics compared to baselines.

2019 ◽  
Vol 47 (2) ◽  
pp. 76-83 ◽  
Author(s):  
Gabrijela Dimic ◽  
Dejan Rancic ◽  
Nemanja Macek ◽  
Petar Spalevic ◽  
Vida Drasute

Purpose This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment. Design/methodology/approach To extract the most relevant activity feature subset, different feature-selection methods were applied. For different cardinality subsets, classification models were used in the comparison. Findings Experimental evaluation oppose the hypothesis that feature vector dimensionality reduction leads to prediction accuracy increasing. Research limitations/implications Improving prediction accuracy in a described learning environment was based on applying synthetic minority oversampling technique, which had affected results on correlation-based feature-selection method. Originality/value The major contribution of the research is the proposed methodology for selecting the optimal low-cardinal subset of students’ activities and significant prediction accuracy improvement in a blended learning environment.


Author(s):  
Martina Mariki ◽  
Neema Mduma ◽  
Elizabeth Mkoba

Malaria remains an important cause of death, especially in sub-Saharan Africa with about 228 million malaria cases worldwide and an estimated 405,000 deaths in 2019. Currently, malaria is diagnosed in the health facility using a microscope (BS) or rapid malaria diagnostic test (MRDT) and with area where these tools are inadequate the presumptive treatment is performed. Apart from that self-diagnosis and treatment is also practiced in some of the households. With the high-rate self-medication on malaria drugs, this study aimed at computing the most significant features using feature selection methods for best prediction of malaria in Tanzania that can be used in developing a machine learning model for malaria diagnosis. A malaria symptoms and clinical diagnosis dataset were extracted from patients’ files from four (4) identified health facilities in the regions of Kilimanjaro and Morogoro. These regions were selected to represent the high endemic areas (Morogoro) and low endemic areas (Kilimanjaro) in the country. The dataset contained 2556 instances and 36 variables. The random forest classifier a tree based was used to select the most important features for malaria prediction. Regional based features were obtained to facilitate accurate prediction. The feature ranking as indicated that fever is universally the most influential feature for predicting malaria followed by general body malaise, vomiting and headache. However, these features are ranked differently across the regional datasets. Subsequently, six predictive models, using important features selected by feature selection method, were used to evaluate the features performance. The features identified complies with malaria diagnosis and treatment guideline provided with WHO and Tanzania Mainland. The compliance is observed so as to produce a prediction model that will fit in the current health care provision system in Tanzania.


Author(s):  
Jing Wang ◽  
Xiaobin Cheng ◽  
Xun Wang ◽  
Yan Gao ◽  
Bin Liu ◽  
...  

Abstract t-distributed stochastic neighbour embedding (t-SNE) is of considerable interest in machining condition monitoring for feature selection. In this paper, the neural networks are introduced to solidify the manifold of the t-SNE prior to classification. This leads to the improved feature selection method, namely the Net-SNE. Conventional statistical features are first extracted from vibration signals to form a high dimensional feature vector. The redundancies in the feature vector are subsequently removed by the t-SNE. Then the neural networks build a mapping model between the high dimensional feature vector and the selected features. The new data is calculated directly using the mapping model. The experiments were conducted on a lathe and a milling machine to collect vibration signals under common working conditions. The K-nearest neighbour classifier is applied to a small sample case and a class-imbalance case to compare the classification performance with and without the Net-SNE. The results demonstrate that the Net-SNE has the advantage over the t-SNE, since it can mine the discriminative features and solidifiy the manifold in the calculation of the new data. Moreover, the proposed method significantly improves the classification accuracy by Net-SNE, along with better classification performance in data-limited situations.


2019 ◽  
Vol 9 (6) ◽  
pp. 1161 ◽  
Author(s):  
Xiaoyue Chen ◽  
Xiaoyan Zhang ◽  
Jian Zhou ◽  
Ke Zhou

Rolling element bearings (REB) are widely used in all walks of life, and they play an important role in the health operation of all kinds of rotating machinery. Therefore, the fault diagnosis of REB has attracted substantial attention. Fault diagnosis methods based on time-frequency signal analysis and intelligent classification are widely used for REB because of their effectiveness. However, there still exist two shortcomings in these fault diagnosis methods: (1) A large amount of redundant information is difficult to identify and delete. (2) Aliasing patterns decrease the methods’ classification accuracy. To overcome these problems, this paper puts forward an improved fault diagnosis method based on tree heuristic feature selection (THFS) and the dependent feature vector combined with rough sets (RS-DFV). In the RS-DFV method, the feature set was optimized through the dependent feature vector (DFV). Furthermore, the DFV revealed the essential difference among different REB faults and improved the accuracy of fault description. Moreover, the rough set was utilized to reasonably describe the aliasing patterns and overcome the problem of abnormal termination in DFV extraction. In addition, a tree heuristic feature selection method (THFS) was devised to delete the redundant information and construct the structure of RS-DFV. Finally, a simulation, four other feature vectors, three other feature selection methods and four other fault diagnosis methods were utilized for the REB fault diagnosis to demonstrate the effectiveness of the RS-DFV method. RS-DFV obtained an effective subset of five features from 100 features, and acquired a very good diagnostic accuracy (100%, 100%, 99.51%, 100%, 99.47%, 100%), which is much higher than all comparative tests. The results indicate that the RS-DFV method could select an appropriate feature set, deeply dig the effectiveness of the features and more exactly describe the aliasing patterns. Consequently, this method performs better in REB fault diagnosis than the original intelligent methods.


2021 ◽  
Vol 21 (3) ◽  
pp. 1-18
Author(s):  
Mehedi Masud ◽  
Parminder Singh ◽  
Gurjot Singh Gaba ◽  
Avinash Kaur ◽  
Roobaea Alrobaea Alghamdi ◽  
...  

Edge Artificial Intelligence (AI) is the latest trend for next-generation computing for data analytics, particularly in predictive edge analytics for high-risk diseases like Parkinson’s Disease (PD). Deep learning learning techniques facilitate edge AI applications for enhanced, real-time handling of data. Dopamine is the cause of Parkinson’s that happens due to the interference of brain cells that produce the substance to regulate the communication of brain cells. The brain cells responsible for generating the dopamine perform adaptation, control, and movement with fluency. Parkinson’s motor symptoms appear on the loss of 60% to 80% of cells, due to the non-production of appropriate dopamine. Recent research found a close connection between the speech impairment and PD. Many researchers have developed a classification algorithm to identify the PD from speech signals. In this article, Adaptive Crow Search Algorithm (ACSA) and Deep Learning (DL)–based optimal feature selection method are introduced. The proposed model is the combination of CROW Search and Deep learning (CROWD) stack sparse autoencoder neural network. Parkinson’s dataset is taken for the experiment from the Irvine dataset repository at the University of California (UCI). In the first phase, dataset cleaning is performed to handle the missing values in the dataset. After that, the proposed ACSA algorithm is employed to find the scrunched feature vector. Furthermore, stack spare autoencoder with seven hidden layers is employed to generate the compressed feature vector. The performance of the proposed CROWD autoencoder model is compared with three feature selection approaches for six supervised classification techniques. The experiment result demonstrates that the performance of the proposed CROWD autoencoder feature selection model has outperformed the benchmarked feature selection techniques: (i) Maximum Relevance (mRMR) (ii) Recursive Feature Elimination (RFE), and (iii) Correlation-based Feature Selection (CFS), to classify Parkinson’s disease. This research has significance in the healthcare sector for the enhancement of classification accuracy up to 0.96%.


2009 ◽  
Vol 29 (10) ◽  
pp. 2812-2815
Author(s):  
Yang-zhu LU ◽  
Xin-you ZHANG ◽  
Yu QI

Author(s):  
Ashish Shah ◽  
Vaishali Patel ◽  
Bhumika Parmar

Background: Novel Corona virus is a type of enveloped viruses with a single stranded RNA enclosing helical nucleocapsid. The envelope consists of spikes on the surface which are made up of proteins through which virus enters into human cells. Until now there is no specific drug or vaccine available to treat COVID-19 infection. In this scenario, reposting of drug or active molecules may provide rapid solution to fight against this deadly disease. Objective: We had selected 30 phytoconstituents from the different plants which are reported for antiviral activities against corona virus (CoVs) and performed insilico screening to find out phytoconstituents which have potency to inhibit specific target of novel corona virus. Methods: We had perform molecular docking studies on three different proteins of novel corona virus namely COVID-19 main protease (3CL pro), papain-like protease (PL pro) and spike protein (S) attached to ACE2 binding domain. The screening of the phytoconstituents on the basis of binding affinity compared to standard drugs. The validations of screened compounds were done using ADMET and bioactivity prediction. Results: We had screened five compounds biscoclaurine, norreticuline, amentoflavone, licoricidin and myricetin using insilico approach. All compounds found safe in insilico toxicity studies. Bioactivity prediction reviles that these all compounds may act through protease or enzyme inhibition. Results of compound biscoclaurine norreticuline were more interesting as this biscoclaurine had higher binding affinity for the target 3CLpro and PLpro targets and norreticuline had higher binding affinity for the target PLpro and Spike protein. Conclusion: Our study concludes that these compounds could be further explored rapidly as it may have potential to fight against COVID-19.


2021 ◽  
Vol 15 (8) ◽  
pp. 912-926
Author(s):  
Ge Zhang ◽  
Pan Yu ◽  
Jianlin Wang ◽  
Chaokun Yan

Background: There have been rapid developments in various bioinformatics technologies, which have led to the accumulation of a large amount of biomedical data. However, these datasets usually involve thousands of features and include much irrelevant or redundant information, which leads to confusion during diagnosis. Feature selection is a solution that consists of finding the optimal subset, which is known to be an NP problem because of the large search space. Objective: For the issue, this paper proposes a hybrid feature selection method based on an improved chemical reaction optimization algorithm (ICRO) and an information gain (IG) approach, which called IGICRO. Methods: IG is adopted to obtain some important features. The neighborhood search mechanism is combined with ICRO to increase the diversity of the population and improve the capacity of local search. Results: Experimental results of eight public available data sets demonstrate that our proposed approach outperforms original CRO and other state-of-the-art approaches.


2019 ◽  
Vol 12 (4) ◽  
pp. 329-337 ◽  
Author(s):  
Venubabu Rachapudi ◽  
Golagani Lavanya Devi

Background: An efficient feature selection method for Histopathological image classification plays an important role to eliminate irrelevant and redundant features. Therefore, this paper proposes a new levy flight salp swarm optimizer based feature selection method. Methods: The proposed levy flight salp swarm optimizer based feature selection method uses the levy flight steps for each follower salp to deviate them from local optima. The best solution returns the relevant and non-redundant features, which are fed to different classifiers for efficient and robust image classification. Results: The efficiency of the proposed levy flight salp swarm optimizer has been verified on 20 benchmark functions. The anticipated scheme beats the other considered meta-heuristic approaches. Furthermore, the anticipated feature selection method has shown better reduction in SURF features than other considered methods and performed well for histopathological image classification. Conclusion: This paper proposes an efficient levy flight salp Swarm Optimizer by modifying the step size of follower salp. The proposed modification reduces the chances of sticking into local optima. Furthermore, levy flight salp Swarm Optimizer has been utilized in the selection of optimum features from SURF features for the histopathological image classification. The simulation results validate that proposed method provides optimal values and high classification performance in comparison to other methods.


Author(s):  
Fatemeh Alighardashi ◽  
Mohammad Ali Zare Chahooki

Improving the software product quality before releasing by periodic tests is one of the most expensive activities in software projects. Due to limited resources to modules test in software projects, it is important to identify fault-prone modules and use the test sources for fault prediction in these modules. Software fault predictors based on machine learning algorithms, are effective tools for identifying fault-prone modules. Extensive studies are being done in this field to find the connection between features of software modules, and their fault-prone. Some of features in predictive algorithms are ineffective and reduce the accuracy of prediction process. So, feature selection methods to increase performance of prediction models in fault-prone modules are widely used. In this study, we proposed a feature selection method for effective selection of features, by using combination of filter feature selection methods. In the proposed filter method, the combination of several filter feature selection methods presented as fused weighed filter method. Then, the proposed method caused convergence rate of feature selection as well as the accuracy improvement. The obtained results on NASA and PROMISE with ten datasets, indicates the effectiveness of proposed method in improvement of accuracy and convergence of software fault prediction.


Sign in / Sign up

Export Citation Format

Share Document