scholarly journals MeShClust: an intelligent tool for clustering DNA sequences

2017 ◽  
Author(s):  
Benjamin T. James ◽  
Brian B. Luczak ◽  
Hani Z. Girgis

ABSTRACTSequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust’s ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.

Author(s):  
Ye Lv ◽  
Guofeng Wang ◽  
Xiangyun Hu

At present, remote sensing technology is the best weapon to get information from the earth surface, and it is very useful in geo- information updating and related applications. Extracting road from remote sensing images is one of the biggest demand of rapid city development, therefore, it becomes a hot issue. Roads in high-resolution images are more complex, patterns of roads vary a lot, which becomes obstacles for road extraction. In this paper, a machine learning based strategy is presented. The strategy overall uses the geometry features, radiation features, topology features and texture features. In high resolution remote sensing images, the images cover a great scale of landscape, thus, the speed of extracting roads is slow. So, roads’ ROIs are firstly detected by using Houghline detection and buffering method to narrow down the detecting area. As roads in high resolution images are normally in ribbon shape, mean-shift and watershed segmentation methods are used to extract road segments. Then, Real Adaboost supervised machine learning algorithm is used to pick out segments that contain roads’ pattern. At last, geometric shape analysis and morphology methods are used to prune and restore the whole roads’ area and to detect the centerline of roads.


2018 ◽  
Author(s):  
Benjamin T. James ◽  
Hani Z. Girgis

ABSTRACTGrouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.


Author(s):  
Ye Lv ◽  
Guofeng Wang ◽  
Xiangyun Hu

At present, remote sensing technology is the best weapon to get information from the earth surface, and it is very useful in geo- information updating and related applications. Extracting road from remote sensing images is one of the biggest demand of rapid city development, therefore, it becomes a hot issue. Roads in high-resolution images are more complex, patterns of roads vary a lot, which becomes obstacles for road extraction. In this paper, a machine learning based strategy is presented. The strategy overall uses the geometry features, radiation features, topology features and texture features. In high resolution remote sensing images, the images cover a great scale of landscape, thus, the speed of extracting roads is slow. So, roads’ ROIs are firstly detected by using Houghline detection and buffering method to narrow down the detecting area. As roads in high resolution images are normally in ribbon shape, mean-shift and watershed segmentation methods are used to extract road segments. Then, Real Adaboost supervised machine learning algorithm is used to pick out segments that contain roads’ pattern. At last, geometric shape analysis and morphology methods are used to prune and restore the whole roads’ area and to detect the centerline of roads.


2020 ◽  
Vol 218 ◽  
pp. 03046
Author(s):  
Jianguo Zhou ◽  
Renyang Liu ◽  
Zifeng Wu ◽  
Jintao Zhang ◽  
Junhui Liu

How to discriminate distal regulatory elements to a gene target is challenging in understanding gene regulation and illustrating causes of complex diseases. Among known distal regulatory elements, enhancers interact with a target gene’s promoter to regulate its expression. Although the emergence of many machine learning approaches has been able to predict enhancer-promoter interactions (EPIs), global and precise prediction of EPIs at the genomic level still requires further exploration.In this paper, we develop an integrated EPIs prediction method, called EpPredictor with improved performance. By using various features of histone modifications, transcription factor binding sites, and DNA sequences among the human genome, a robust supervised machine learning algorithm, named LightGBM, is introduced to predict enhancer-promoter interactions (EPIs). Among six different cell lines, our method effectively predicts the enhancer-promoter interactions (EPIs) and achieves better performance in F1-score and AUC compared to other methods, such as TargetFinder and PEP.


2013 ◽  
Vol 660 ◽  
pp. 190-195
Author(s):  
Zi Cheng Ren ◽  
Jaeho Choi ◽  
M. Ahmed ◽  
Jae Ho Choi

Object tracking has been researched for many years as an important topic in machine learning, robot vision and many other fields. Over the years, various tracking methods are proposed and developed in order to gain a better tracking effect. Among them the mean-shift algorithm turns out to be robust and accurate compared other algorithms after different kinds of tests. But due to its limitations, the changes in scale and rotational motion of an object cannot be effectively processed. This problem occurs when the object of interest moves towards or away from the video camera. Improving over the previously proposed method such as scale and orientation adaptive mean shift tracking, which performs well with scaling change but not for the rotation, in this paper, the proposed method modifies the continuously adaptive mean shift tracking method so that it can handle effectively for changes in size and rotation in motion, simultaneously. The simulation results yield a successful tracking of moving objects even when the object undergoes scaling in size and rotation in motion in comparison to the conventional ones.


2019 ◽  
Vol 23 (1) ◽  
pp. 12-21 ◽  
Author(s):  
Shikha N. Khera ◽  
Divya

Information technology (IT) industry in India has been facing a systemic issue of high attrition in the past few years, resulting in monetary and knowledge-based loses to the companies. The aim of this research is to develop a model to predict employee attrition and provide the organizations opportunities to address any issue and improve retention. Predictive model was developed based on supervised machine learning algorithm, support vector machine (SVM). Archival employee data (consisting of 22 input features) were collected from Human Resource databases of three IT companies in India, including their employment status (response variable) at the time of collection. Accuracy results from the confusion matrix for the SVM model showed that the model has an accuracy of 85 per cent. Also, results show that the model performs better in predicting who will leave the firm as compared to predicting who will not leave the company.


Friction ◽  
2021 ◽  
Author(s):  
Vigneashwara Pandiyan ◽  
Josef Prost ◽  
Georg Vorlaufer ◽  
Markus Varga ◽  
Kilian Wasmer

AbstractFunctional surfaces in relative contact and motion are prone to wear and tear, resulting in loss of efficiency and performance of the workpieces/machines. Wear occurs in the form of adhesion, abrasion, scuffing, galling, and scoring between contacts. However, the rate of the wear phenomenon depends primarily on the physical properties and the surrounding environment. Monitoring the integrity of surfaces by offline inspections leads to significant wasted machine time. A potential alternate option to offline inspection currently practiced in industries is the analysis of sensors signatures capable of capturing the wear state and correlating it with the wear phenomenon, followed by in situ classification using a state-of-the-art machine learning (ML) algorithm. Though this technique is better than offline inspection, it possesses inherent disadvantages for training the ML models. Ideally, supervised training of ML models requires the datasets considered for the classification to be of equal weightage to avoid biasing. The collection of such a dataset is very cumbersome and expensive in practice, as in real industrial applications, the malfunction period is minimal compared to normal operation. Furthermore, classification models would not classify new wear phenomena from the normal regime if they are unfamiliar. As a promising alternative, in this work, we propose a methodology able to differentiate the abnormal regimes, i.e., wear phenomenon regimes, from the normal regime. This is carried out by familiarizing the ML algorithms only with the distribution of the acoustic emission (AE) signals captured using a microphone related to the normal regime. As a result, the ML algorithms would be able to detect whether some overlaps exist with the learnt distributions when a new, unseen signal arrives. To achieve this goal, a generative convolutional neural network (CNN) architecture based on variational auto encoder (VAE) is built and trained. During the validation procedure of the proposed CNN architectures, we were capable of identifying acoustics signals corresponding to the normal and abnormal wear regime with an accuracy of 97% and 80%. Hence, our approach shows very promising results for in situ and real-time condition monitoring or even wear prediction in tribological applications.


Genes ◽  
2021 ◽  
Vol 12 (4) ◽  
pp. 527
Author(s):  
Eran Elhaik ◽  
Dan Graur

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.


2016 ◽  
Vol 348 ◽  
pp. 198-208 ◽  
Author(s):  
Youness Aliyari Ghassabeh ◽  
Frank Rudzicz

Sign in / Sign up

Export Citation Format

Share Document