DATA MINING TOOLS FOR BIOLOGICAL SEQUENCES

We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.

Download Full-text

Mapping Biomolecular Sequences: Graphical Representations - their Origins, Applications and Future Prospects

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210510164743 ◽

2021 ◽

Vol 24 ◽

Author(s):

Ashesh Nandy

Keyword(s):

Dna Sequences ◽

Graphical Representation ◽

Sequence Data ◽

Basic Unit ◽

Graphical Representations ◽

Biological Sequences ◽

Biological Sequence ◽

New Approach ◽

3D Space ◽

2D And 3D

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.

Download Full-text

Identification of important features and data mining classification techniques in predicting employee absenteeism at work

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i5.pp4587-4596 ◽

2021 ◽

Vol 11 (5) ◽

pp. 4587

Author(s):

Amal Al-Rasheed

Keyword(s):

Data Mining ◽

Feature Selection ◽

Prediction Models ◽

Information Gain ◽

Feature Selection Method ◽

Performance Measurements ◽

Data Mining Technique ◽

Classification Techniques ◽

Mining Technique ◽

Correlation Based Feature Selection

Employees absenteeism at the work costs organizations billions a year. Prediction of employees’ absenteeism and the reasons behind their absence help organizations in reducing expenses and increasing productivity. Data mining turns the vast volume of human resources data into information that can help in decision-making and prediction. Although the selection of features is a critical step in data mining to enhance the efficiency of the final prediction, it is not yet known which method of feature selection is better. Therefore, this paper aims to compare the performance of three well-known feature selection methods in absenteeism prediction, which are relief-based feature selection, correlation-based feature selection and information-gain feature selection. In addition, this paper aims to find the best combination of feature selection method and data mining technique in enhancing the absenteeism prediction accuracy. Seven classification techniques were used as the prediction model. Additionally, cross-validation approach was utilized to assess the applied prediction models to have more realistic and reliable results. The used dataset was built at a courier company in Brazil with records of absenteeism at work. Regarding experimental results, correlationbased feature selection surpasses the other methods through the performance measurements. Furthermore, bagging classifier was the best-performing data mining technique when features were selected using correlation-based feature selection with an accuracy rate of (92%).

Download Full-text

The economics of selection of mail orders Drs. Zahavi and Levin are the masterminds behind the development of AMOS, a customized predictive modeling system for the Franklin Mint in Philadelphia, and GainSmarts, a general purpose data mining system that is the two-time winner of the KDD-CUP competition for the best data mining tools (1997 and 1998) sponsored by the American Association for Artificial Intelligence.

Journal of Interactive Marketing ◽

10.1002/dir.1016.abs ◽

2001 ◽

Vol 15 (3) ◽

pp. 53

Author(s):

Nissan Levin ◽

Jacob Zahavi

Keyword(s):

Artificial Intelligence ◽

Data Mining ◽

Predictive Modeling ◽

American Association ◽

General Purpose ◽

Mining System ◽

Data Mining System ◽

Mining Tools ◽

Selection Of

Download Full-text

Characterization of squalene synthase gene from Gymnema sylvestre R. Br.

Beni-Suef University Journal of Basic and Applied Sciences ◽

10.1186/s43088-020-00094-4 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Kuldeepsingh A. Kalariya ◽

Ram Prasnna Meena ◽

Lipi Poojara ◽

Deepa Shahi ◽

Sandip Patel

Keyword(s):

Dna Sequences ◽

Genomic Dna ◽

Competitive Inhibition ◽

Sequence Data ◽

Homology Model ◽

Squalene Synthase ◽

Gymnema Sylvestre ◽

Gardenia Jasminoides ◽

Ramachandran Plots ◽

Flanking Regions

Abstract Background Squalene synthase (SQS) is a rate-limiting enzyme necessary to produce pentacyclic triterpenes in plants. It is an important enzyme producing squalene molecules required to run steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode. Reports are available on information pertaining to SQS gene in several plants, but detailed information on SQS gene in Gymnema sylvestre R. Br. is not available. G. sylvestre is a priceless rare vine of central eco-region known for its medicinally important triterpenoids. Our work aims to characterize the GS-SQS gene in this high-value medicinal plant. Results Coding DNA sequences (CDS) with 1245 bp length representing GS-SQS gene predicted from transcriptome data in G. sylvestre was used for further characterization. The SWISS protein structure modeled for the GS-SQS amino acid sequence data had MolProbity Score of 1.44 and the Clash Score 3.86. The quality estimates and statistical score of Ramachandran plots analysis indicated that the homology model was reliable. For full-length amplification of the gene, primers designed from flanking regions of CDS encoding GS-SQS were used to get amplification against genomic DNA as template which resulted in approximately 6.2-kb sized single-band product. The sequencing of this product through NGS was carried out generating 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp. These scaffolds were compared to identify similarity with other SQS genes as well as the GS-SQSs of the transcriptome. Scaffold_3347 representing the GS-SQS gene harbored two introns of 101 and 164 bp size. Both these intronic regions were validated by primers designed from adjoining outside regions of the introns on the scaffold representing GS-SQS gene. The amplification took place when the template was genomic DNA and failed when the template was cDNA confirmed the presence of two introns in GS-SQS gene in Gymnema sylvestre R. Br. Conclusion This study shows GS-SQS gene was very closely related to Coffea arabica and Gardenia jasminoides and this gene harbored two introns of 101 and 164 bp size.

Download Full-text

Hot-spot analysis of motorcyclist crashes involving fixed objects using multinomial logit and data mining tools

Journal of Transportation Safety & Security ◽

10.1080/19439962.2021.1898070 ◽

2021 ◽

pp. 1-19

Author(s):

Bahar Dadashova ◽

Chiara Silvestri-Dobrovolny ◽

Jayveersinh Chauhan ◽

Marcie Perez ◽

Roger Bligh

Keyword(s):

Data Mining ◽

Hot Spot ◽

Multinomial Logit ◽

Hot Spot Analysis ◽

Mining Tools ◽

Spot Analysis

Download Full-text

Extracting knowledge from building-related data — A data mining framework

Building Simulation ◽

10.1007/s12273-013-0117-8 ◽

2013 ◽

Vol 6 (2) ◽

pp. 207-222 ◽

Cited By ~ 36

Author(s):

Zhun Yu ◽

Benjamin C. M. Fung ◽

Fariborz Haghighat

Keyword(s):

Data Mining ◽

Related Data

Download Full-text

Molecular evolution of homologous gene sequences in germline-limited and somatic chromosomes of Acricotopus

Genome ◽

10.1139/g04-026 ◽

2004 ◽

Vol 47 (4) ◽

pp. 732-741 ◽

Cited By ~ 2

Author(s):

Wolfgang Staiber

Keyword(s):

Molecular Evolution ◽

Dna Sequences ◽

Sequence Data ◽

Structural Evolution ◽

5S Rdna ◽

Oligonucleotide Primer ◽

Homologous Gene ◽

Gene Sequences ◽

Nucleotide Substitutions ◽

Degenerate Oligonucleotide

The origin of germline-limited chromosomes (Ks) as descendants of somatic chromosomes (Ss) and their structural evolution was recently elucidated in the chironomid Acricotopus. The Ks consist of large S-homologous sections and of heterochromatic segments containing germline-specific, highly repetitive DNA sequences. Less is known about the molecular evolution and features of the sequences in the S-homologous K sections. More information about this was received by comparing homologous gene sequences of Ks and Ss. Genes for 5.8S, 18S, 28S, and 5S ribosomal RNA were choosen for the comparison and therefore isolated first by PCR from somatic DNA of Acricotopus and sequenced. Specific K DNA was collected by microdissection of monopolar moving K complements from differential gonial mitoses and was then amplified by degenerate oligonucleotide primer (DOP)-PCR. With the sequence data of the somatic rDNAs, the homologous 5.8S and 5S rDNA sequences were isolated by PCR from the DOP-PCR sequence pool of the Ks. In addition, a number of K DOP-PCR sequences were directly cloned and analysed. One K clone contained a section of a putative N-acetyltransferase gene. Compared with its homolog from the Ss, the sequence exhibited few nucleotide substitutions (99.2% sequence identity). The same was true for the 5.8S and 5S sequences from Ss and Ks (97.5%100% identity). This supports the idea that the S-homologous K sequences may be conserved and do not evolve independently from their somatic homologs. Possible mechanisms effecting such conservation of S-derived sequences in the Ks are discussed.Key words: microdissection, DOP-PCR, germline-limited chromosomes, molecular evolution.

Download Full-text