Pattern Discovery in Biomolecular Data
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780195119404, 9780197561256

Author(s):  
Bruce A. Shapiro ◽  
Wojciech Kasprzak

Genomic information (nucleic acid and amino acid sequences) completely determines the characteristics of the nucleic acid and protein molecules that express a living organism’s function. One of the greatest challenges in which computation is playing a role is the prediction of higher order structure from the one-dimensional sequence of genes. Rules for determining macromolecule folding have been continually evolving. Specifically in the case of RNA (ribonucleic acid) there are rules and computer algorithms/systems (see below) that partially predict and can help analyze the secondary and tertiary interactions of distant parts of the polymer chain. These successes are very important for determining the structural and functional characteristics of RNA in disease processes and hi the cell life cycle. It has been shown that molecules with the same function have the potential to fold into similar structures though they might differ in their primary sequences. This fact also illustrates the importance of secondary and tertiary structure in relation to function. Examples of such constancy in secondary structure exist in transfer RNAs (tRNAs), 5s RNAs, 16s RNAs, viroid RNAs, and portions of retroviruses such as HIV. The secondary and tertiary structure of tRNA Phe (Kim et al., 1974), of a hammerhead ribozyme (Pley et al., 1994), and of Tetrahymena (Cate et al., 1996a, 1996b) have been shown by their crystal structure. Currently little is known of tertiary interactions, but studies on tRNA indicate these are weaker than secondary structure interactions (Riesner and Romer, 1973; Crothers and Cole, 1978; Jaeger et al., 1989b). It is very difficult to crystallize and/or get nuclear magnetic resonance spectrum data for large RNA molecules. Therefore, a logical place to start in determining the 3D structure of RNA is computer prediction of the secondary structure. The sequence (primary structure) of an RNA molecule is relatively easy to produce. Because experimental methods for determining RNA secondary and tertiary structure (when the primary sequence folds back on itself and forms base pairs) have not kept pace with the rapid discovery of RNA molecules and their function, use of and methods for computer prediction of secondary and tertiary structures have increasingly been developed.


Author(s):  
Diane J. Cook ◽  
Lawrence B. Holder

The large amount of data collected today is quickly overwhelming researchers’ abilities to interpret the data and discover interesting patterns. In response to this problem, a number of researchers have developed techniques for discovering concepts in databases. These techniques work well for data expressed in a nonstructural, attribute-value representation and address issues of data relevance, missing data, noise and uncertainty, and utilization of domain knowledge (Fisher, 1987; Cheeseman and Stutz, 1996). However, recent data acquisition projects are collecting structural data describing the relationships among the data objects. Correspondingly, there exists a need for techniques to analyze and discover concepts in structural databases (Fayyad et al., 1996b). One method for discovering knowledge in structural data is the identification of common substructures. The goal is to find substructures capable of compressing the data and to identify conceptually interesting substructures that enhance the interpretation of the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. Once discovered, the substructure concept can be used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered concept. The discovered substructure concepts allow abstraction over detailed structure in the original data and provide new, relevant attributes for interpreting the data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the goals of the data analysis. We describe a system called Subdue that discovers interesting substructures in structural data based on the minimum description length (MDL) principle. The Subdue system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously discovered substructures, multiple passes of Subdue produce a hierarchical description of the structural regularities in the data. Subdue uses a computationally bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints.


Author(s):  
Jorja G. Henikoff

A block is an ungapped local multiple alignment of amino acid sequences from a group of related proteins. Ideally, the contiguous stretch of residues represented by a block is conserved for biological function. Blocks have depth (the number of sequences) and width (the number of aligned positions). There are currently several useful programs for finding blocks in a group of related sequences that I do not discuss in detail here. Among these, Motif (Smith et al., 1990) and Asset (Neuwald and Green, 1994) both align blocks on occurrences of certain types of patterns found in the sequences; Gibbs (Lawrence et al., 1993; Neuwald et al., 1995) and MEME (Bailey and Elkan, 1994) both look for statistically optimal local alignments; and Macaw (Schuler et al., 1991) and Somap (Parry-Smith and Attwood, 1992) both give the user assistance in finding blocks interactively. After candidate blocks are identified by a block-finding method, they can be evaluated and assembled into a set representing the protein group, resulting in a multiple alignment consisting of ungapped regions separated by unaligned regions of variable length. The block assembly process is the subject of this chapter. Both the Blocks (Henikoff and Henikoff, 1996a) and Prints (Attwood and Beck, 1994) databases consist of such sets of blocks and between them currently represent 1,163 different protein groups. These collections of blocks are more sensitive and efficient for classifying new sequences into known protein groups than are collections of individual sequences, as demonstrated by comprehensive evaluations (Henikoff and Henikoff, 1994b, 1997), by genomic studies (Green et al., 1993), and by individual studies (Posfai et al., 1988; Henikoff, 1992, 1993; Attwood and Findlay, 1993; Pietrokovski, 1994; Brown, 1995). Issues that must be addressed during block assembly include the number of blocks provided to the assembly module by the block finders, block width, the number of times a block occurs in each sequence (zero to many), overlap of blocks, and the order of multiple blocks within each sequence. Once these issues are decided, it is necessary to score individual competing blocks and then competing sets of blocks.


Author(s):  
Janice Glasgow ◽  
Evan Steeg

The field of knowledge discovery is concerned with the theory and processes involved in the representation and extraction of patterns or motifs from large databases. Discovered patterns can be used to group data into meaningful classes, to summarize data, or to reveal deviant entries. Motifs stored in a database can be brought to bear on difficult instances of structure prediction or determination from X-ray crystallography or nuclear magnetic resonance (NMR) experiments. Automated discovery techniques are central to understanding and analyzing the rapidly expanding repositories of protein sequence and structure data. This chapter deals with the discovery of protein structure motifs. A motif is an abstraction over a set of recurring patterns observed in a dataset; it captures the essential features shared by a set of similar or related objects. In many domains, such as computer vision and speech recognition, there exist special regularities that permit such motif abstraction. In the protein science domain, the regularities derive from evolutionary and biophysical constraints on amino acid sequences and structures. The identification of a known pattern in a new protein sequence or structure permits the immediate retrieval and application of knowledge obtained from the analysis of other proteins. The discovery and manipulation of motifs—in DNA, RNA, and protein sequences and structures—is thus an important component of computational molecular biology and genome informatics. In particular, identifying protein structure classifications at varying levels of abstraction allows us to organize and increase our understanding of the rapidly growing protein structure datasets. Discovered motifs are also useful for improving the efficiency and effectiveness of X-ray crystallographic studies of proteins, for drug design, for understanding protein evolution, and ultimately for predicting the structure of proteins from sequence data. Motifs may be designed by hand, based on expert knowledge. For example, the Chou-Fasman protein secondary structure prediction program (Chou and Fasman, 1978), which dominated the field for many years, depended on the recognition of predefined, user-encoded sequence motifs for α-helices and β-sheets. Several hundred sequence motifs have been cataloged in PROSITE (Bairoch, 1992); the identification of one of these motifs in a novel protein often allows for immediate function interpretation.


Author(s):  
David P. Yee ◽  
Tim Hunkapiller

The Human Genome Project was launched in the early 1990s to map, sequence, and study the function of genomes derived from humans and a number of model organisms such as mouse, rat, fruit fly, worm, yeast, and Escherichia coli. This ambitious project was made possible by advances in high-speed DNA sequencing technology (Hunkapiller et al., 1991). To date, the Human Genome Project and other large-scale sequencing projects have been enormously successful. The complete genomes of several microbes (such as Hemophilus influenzae Rd, Mycoplasma genitalium, and Methanococcus jannaschii) have been completely sequenced. The genome of bacteriophage T4 is complete, and the 4.6-megabase sequence of E. coli and the 13-megabase genome of Saccharomyces cerevisiae have just recently also been completed. There are 71 megabases of the nematode Caenorhabditis elegans available. Six megabases of mouse and 60 megabases of human genomic sequence have been finished, which represent 0.2% and 2% of their respective genomes. Finally, more than 1 million expressed sequence tags derived from human and mouse complementary DNA expression libraries are publicly available. These public data, in addition to private and proprietary DNA sequence databases, represent an enormous information-processing challenge and data-mining opportunity. The need for common interfaces and query languages to access heterogeneous sequence databases is well documented, and several good systems are well underway to provide those interfaces (Woodsmall and Benson, 1993; Marr, 1996). Our own work on database and program interoperability in this domain and in computational chemistry (Gushing, 1995) has shown, however, that providing the interface is but the first step toward making these databases fully useful to the researcher. (Here, the term “database” means a collection of data in electronic form, which may not necessarily be physically deposited in a database management system [DBMS]. A scientist’s database could thus be a collection of flat files, where the term “database” means “data stored in a DBMS” is clear from the context.) Deciphering the genomes of sequenced organisms falls into the realm of analysis; there is now plenty of sequence data. The most common form of sequence analysis involves the identification of homologous relationships among similar sequences.


Author(s):  
Kentaro Tomii ◽  
Minoru Kanehisa

It is widely believed that the prediction of the 3D structure of a protein from its amino acid sequence is important because the structure will help understand the function. As the number of protein structures resolved is increasing, most predictive methods have become based on the knowledge of the repertoire of 3D folds taken by actual proteins. We must emphasize, however, that this type of structure prediction, or fold recognition, concerns the overall folding of the polypeptide chain. Since two similar folds could be due to entirely different sequences and even two similar sequences could have different functions, it is unlikely that successful fold recognition will uncover any functional clue that cannot otherwise be obtained by sequence analysis alone. In contrast to the global feature of 3D folds, the concept of structural motifs or local structures is far more important in understanding protein function. It has been revealed that there are common local folding patterns that appear in many proteins of globally different structures and that are involved in conserved function. In addition, the local sequence patterns associated with these local structures are also often conserved, though the whole sequences can be quite different. At the supersecondary structure level there are, for example, βαβ -unit, EF hand, and helix-turn-helix motifs. Various dehydrogenases have a common structural motif called Rossman fold, which is composed of two consecutive βαβ -units, and most of those proteins also have the sequence motif GxGxxG around the nucleotide binding region (Wierenga and Hol, 1983; Wierenga et al., 1986). The EF hand consisting of the helix-loop-helix structure (Tufty and Kretsinger, 1975) occurs in many calcium-binding domains, and the residues that participate in ligand binding are well conserved. The helix-turn-helix motif that involves about 20 residues appears in a class of DNA-binding domains, and glycine tends to be conserved at a special position in the turn whose conformation corresponds to the left-handed helix. A number of known sequence motifs are registered in the motif libraries such as PROSITE (Bairoch and Bucher, 1994) that compile the relationships between sequence patterns and functions.


Author(s):  
Jason T. L. Wang ◽  
Thomas G. Marr

With the significant growth of the amount of biosequence data, it becomes increasingly important to develop new techniques for finding “knowledge” from the data. Pattern discovery is a fundamental operation in such applications. It attempts to find patterns in biosequences that can help scientists to analyze the property of a sequence or predict the function of a new entity. The discovered patterns may also help to classify an unknown sequence, that is, assign the sequence to an existing family. In this chapter, we show how to discover active patterns in a set of protein sequences and classify an unlabeled DNA sequence. We use protein sequences as an example to illustrate our discovery algorithm, though the algorithm applies to sequences of any sort, including both protein and DNA. The patterns we wish to discover within a set of sequences are regular expressions of the form *X1 * X2 * ... . The X1,X2,... are segments of a sequence, that is, subsequences made up of consecutive letters, and * represents a variable length don’t care (VLDC). In matching the expression *X1 * X2 * ... with a sequence S, the VLDCs may substitute for zero or more letters in S at zero cost. The dissimilarity measure used in comparing two sequences is the edit distance, that is, the minimum cost of edit operations used to transform one subsequence to the other after an optimal and zero-cost substitution for the VLDCs, where the edit operations include insertion, deletion, and change of one letter to another (Wagner and Fischer, 1974; K. Zhang et al., 1994). That is, we find a one-to-one mapping from each VLDC to a subsequence of the data sequence and from each pattern subsequence to a subsequence of the data sequence such that the following two conditions are satisfied, (i) The mapping preserves the left-to-right ordering (if a VLDC at position i in the pattern maps to a subsequence starting at position i1 and ending at position i2 in the data sequence, and a VLDC at position j in the pattern maps to a subsequence starting at position j1 and ending at position j2 in the data sequence, and i < j, then i2 < j2).


Author(s):  
Timothy L. Bailey

We are in the midst of an explosive increase in the number of DNA and protein sequences available for study, as various genome projects come on line. This wealth of information offers important opportunities for understanding many biological processes and developing new plant and animal models, and ultimately drugs, for human diseases, in addition to other applications of modern biotechnology. Unfortunately, sequences are accumulating at a pace that strains present methods for extracting significant biological information from them. A consequence of this explosion in the sequence databases is that there is much interest and effort in developing tools that can efficiently and automatically extract the relevant biological information in sequence data and make it available for use in biology and medicine. In this chapter, we describe one such method that we have developed based on algorithms from artificial intelligence research. We call this software tool MEME (Multiple Expectation-maximization for Motif Elicitation). It has the attractive property that it is an “unsupervised” discovery tool: it can identify motifs, such as regulatory sites in DNA and functional domains in proteins, from large or small groups of unaligned sequences. As we show below, motifs are a rich source of information about a dataset; they can be used to discover other homologs in a database, to identify protein subsets that contain one or more motifs, and to provide information for mutagenesis studies to elucidate structure and function in the protein family as well as its evolution. Learning tools are used to extract higher level biological patterns from lower level DNA and protein sequence data. In contrast, search tools such as BLAST (Basic Local Alignment Search Tool) take a given higher level pattern and find all items in a database that possess the pattern. Searching for items that have a certain pattern is a problem intrinsically easier than discovering what the pattern is from items that possess it. The patterns considered here are motifs, which for DNA data can be subsequences that interact with transcription factors, polymerases, and other proteins.


Author(s):  
Aleksandar Milosavljević

The parsimony method for reconstruction of evolutionary trees (Sober, 1988) and the minimal edit distance method for DNA sequence alignments (e.g., Waterman, 1984) are both based on the principle of Occam’s Razor (e.g., Losee, 1980; also known as the Parsimony principle). This principle favors the most concise theories among the multitudes that can possibly explain observed data. The conciseness may be measured by the number of postulated mutations within an evolutionary tree, by the number of edit operations that transform one DNA sequence into the other, or by another implicit or explicit criterion. A very general mathematical formulation of Occam’s Razor has been proposed via minimal length encoding by computer programs (for recent reviews, see Cover and Thomas, 1991; Li and Vitányi, 1993). Algorithmic significance is a general method for pattern discovery based on Occam’s Razor. The method measures parsimony in terms of encoding length, in bits, of the observed data. Patterns are defined as datasets that can be concisely encoded. The method is not limited to any particular class of patterns; the class of patterns is determined by specifying an encoding scheme. To illustrate the method, consider the following unusual discovery experiment: . . . 1. Pick a simple pseudorandom generator for digits from the set {0, 1, 2, 3}. 2. Pick a seed value for the generator and run it to obtain a sequence of 1000 digits; convert the digits to a DNA sequence by replacing all occurrences of digit 0 by letter A, 1 by G, 2 by C, and 3 by T. 3. Submit the sequence to a similarity search against a database containing a completely sequenced genome of a particular organism. . . . Assume that after an unspecified number of iterations of the three steps, with each iteration involving a different random generator or seed value or both, the search in the third step finally results in a genomic sequence highly similar to the query sequence. Does the genomic sequence contain a pattern? To argue for the presence of a pattern, one may directly apply the algorithmic significance method.


Author(s):  
Bin Li ◽  
Dennis Shasha

Biological pattern discovery problems are computationally expensive. A possible technique for reducing the time to perform pattern discovery is parallelization. Since each task in a biological pattern discovery application is usually time-consuming by itself, we might be able to use networks of workstations (NOWs) that communicate infrequently. Persistent Linda (PLinda) is a distributed parallel computing system that runs on NOWs and it automatically utilizes idle workstations (Anderson and Shasha, 1992; Jeong, 1996). This means that labs can do parallel pattern discovery without buying new hardware. We propose an acyclic directed graph structure, exploration dag (E-dag for short), to characterize computational models of biological pattern discovery applications. An E-dag can first be constructively formed from specifications of a pattern discovery problem; then an E-dag traversal is performed on the fly to solve the problem. When done in parallel, the process of E-dag construction and traversal efficiently solves pattern discovery problems. Parallel E-dag construction and traversal can be easily programmed in PLinda. Finding active motifs in sets of protein sequences and in multiple RNA secondary structures are two examples of biological pattern discovery. Before discussing the framework, we introduce these two applications and briefly describe their computational models. Consider a database of imaginary protein sequences D = {FFRR, MRRM, MTRM, DPKY, AVLG} and the query “Find the patterns P of the form *X* where P occurs in at least two sequences in D and the size of P |P| ≥ 2.” (X can be a segment of a sequence of any length, and * represents a variable length don’t care [VLDC].) The good patterns are *RR* (which occurs in FFRR and MRRM) and *RM* (which occurs in MRRM and MTRM). Pattern discovery in sets of sequences concerns finding commonly occurring subsequences (sometimes called motifs). The structures of the motifs we wish to discover are regular expressions of the form *S1 * S2 * ... where S1,S2,… are segments of a sequence, that is, subsequences made up of consecutive letters, and * represents a VLDC. In matching the expression *S1 * S2 * … with a sequence S, the VLDCs may substitute for zero or more letters in S.


Sign in / Sign up

Export Citation Format

Share Document