Computational Text Analysis
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198567400, 9780191916700

Author(s):  
Soumya Raychaudhuri

The genomics era has presented many new high throughput experimental modalities that are capable of producing large amounts of data on comprehensive sets of genes. In time there will certainly be many more new techniques that explore new avenues in biology. In any case, textual analysis will be an important aspect of the analysis. The body of the peer-reviewed scientific text represents all of our accomplishments in biology, and it plays a critical role in hypothesizing and interpreting any data set. To altogether ignore it is tantamount to reinventing the wheel with each analysis. The volume of relevant literature approaches proportions where it is all but impossible to manually search through all of it. Instead we must often rely on automated text mining methods to access the literature efficiently and effectively. The methods we present in this book provide an introduction to the avenues that one can employ to include text in a meaningful way in the analysis of these functional genomics data sets. They serve as a complement to the statistical methods such as classification and clustering that are commonly employed to analyze data sets. We are hopeful that this book will serve to encourage the reader to utilize and further develop text mining in their own analyses.


Author(s):  
Soumya Raychaudhuri

Successful use of text mining algorithms to facilitate genomics research hinges on the ability to recognize the names of genes in scientific text. In this chapter we address the critical issue of gene name recognition. Once gene names can be recognized in the scientific text, we can begin to understand what the text says about those genes. This is a much more challenging issue than one might appreciate at first glance. Gene names can be inconsistent and confusing; automated gene name recognition efforts have therfore turned out to be quite challenging to implement with high accuracy. Gene name recognition algorithms have a wide range of useful applications. Until this chapter we have been avoiding this issue and have been using only gene-article indices. In practice these indices are manually assembled. Gene name recognition algorithms offer the possibility of automating and expediting the laborious task of building reference indices. Article indices can be built that associate articles to genes based on whether or not the article mentions the gene by name. In addition, gene name recognition is the first step in doing more detailed sentence-by-sentence text analysis. For example, in Chapter 10 we will talk about identifying relationships between genes from text. Frequently, this requires identifying sentences refering to two gene names, and understanding what sort of relationship the sentence is describing between these genes. Sophisticated natural language processing techniques to parse sentences and understand gene function cannot be done in a meaningful way without recognizing where the gene names are in the first place. The major concepts of this chapter are presented in the frame box. We begin by describing the commonly used strategies that can be used alone or in concert to identify gene names. At the end of the chapter we introduce one successful name finding algorithm that combines many of the different strategies. There are several commonly used approaches that can be exploited to recognize gene names in text (Chang, Shutze, et al. 2004). Often times these approaches can be combined into even more effective multifaceted algorithms.


Author(s):  
Soumya Raychaudhuri

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.


Author(s):  
Soumya Raychaudhuri

Genes and proteins interact with each other in many complicated ways. For example, proteins can interact directly with each other to form complexes or to modify each other so that their function is altered. Gene expression can be repressed or induced by transcription factor proteins. In addition there are countless other types of interactions. They constitute the key physiological steps in regulating or initiating biological responses. For example the binding of transcription factors to DNA triggers the assembly of the RNA assembly machinery that transcribes the mRNA that then is used as the template for protein production. Interactions such as these have been carefully elucidated and have been described in great detail in the scientific literature. Modern assays such as yeast-2-hybrid screens offer rapid means to ascertain many of the potential protein–protein interactions in an organism in a large-scale approach. In addition, other experimental modalities such as gene-expression array assays offer indirect clues about possible genetic interactions. One area that has been greatly explored in the bioinformatics literature is the possibility of learning genetic or protein networks, both from the scientific literature and from large-scale experimental data. Indeed, as we get to know more and more genes, it will become increasingly important to appreciate their interactions with each other. An understanding of the interactions between genes and proteins in a network allows for a meaningful global view of the organism and its physiology and is necessary to better understand biology. In this chapter we will explore methods to either (1) mine the scientific literature to identify documented genetic interactions and build networks of genes or (2) to confirm protein interactions that have been proposed experimentally. Our focus here is on direct physical protein–protein interactions, though the techniques described could be extended to any type of biological interaction between genes or proteins. There are multiple steps that must be addressed in identifying genetic interaction information contained within the text. After compiling the necessary documents and text, the first step is to identify gene and protein names in the text.


Author(s):  
Soumya Raychaudhuri

The analysis of large-scale genomic data (such as sequences or expression patterns) frequently involves grouping genes based on common experimental features. The goal of manual or automated analysis of genomics data is to define groups of genes that have shared features within the data, and also have a common biological basis that can account for those commonalities. In utilizing algorithms that define groups of genes based on patterns in data it is critical to be able to assess whether the groups also share a common biological function. In practice, this goal is met by relying on biologists with an extensive understanding of diverse genes that decipher the biology accounting for genes with correlated patterns. They identify the relevant functions that account for experimental results. For example, experts routinely scan large numbers of gene expression clusters to see if any of the clusters are explained by a known biological function. Efficient definition and interpretation of these groups of genes is challenging because the number and diversity of genes exceed the ability of any single investigator to master. Here, we argue that computational methods can utilize the scientific literature to effectively assess groups of genes. Such methods can then be used to analyze groups of genes created by other bioinformatics algorithms, or actually assist in the definition of gene groups. In this chapter we explore statistical scoring methods that score the ‘‘coherence’’ of a gene group using only the scientific literature about the genes—that is whether or not a common function is shared between the genes in the group. We propose and evaluate such a method, and compare it to some other possible methods. In the subsequent chapter, we apply these concepts to gene expression analysis. The major concepts of this chapter are described in the frame box. We begin by introducing the concept of functional coherence. We describe four different strategies to assess the functional coherence of a group of genes. The final part of the chapter emphasizes the most effective of these methods, the neighbor divergence per gene. We present a discussion of its performance properties in general and on its robustness given imperfect groups. Finally we present an example of an application to gene expression array data.


Author(s):  
Soumya Raychaudhuri

Using algorithms to analyze natural language text is a challenging task. Recent advances in algorithms, and increased availability of computational power and online text has resulted in incremental progress in text analysis (Rosenfeld 2000). For certain specific applications natural language processing algorithms can rival human performance. Even the simplest algorithms and approaches can glean information from the text and do it at a rate much faster than humans. In the case of functional genomics, where an individual assay might include thousands of genes, and tens of thousands of documents pertinent to those genes, the speed of text mining approaches offers a great advantage to investigators trying to understand the data. In this chapter, we will focus on techniques to convert text into simple numerical vectors to facilitate computation. Then we will go on to discuss how these vectors can be combined into textual profiles for genes; these profiles offer additional biologically meaningful information that can complement available genomics data sets. The previous chapter introduced methods to analyze gene expression data and sequence data. The focus of many analytical methods was comparing and grouping genes by similarity. Some sequence analysis methods like dynamic programming and BLAST offer opportunities to compare two sequences, while multiple sequence alignment and weight matrices provide a means to compare families of sequences. Similarly, gene expression array analysis approaches are mostly contingent on distance metrics that compare gene expression profiles to each other; clustering and classification algorithms provide a means to group similar genes. The primary goal of applying these methods was to transfer knowledge between similar genes. We can think of the scientific literature as yet another data type and define document similarity metrics. Algorithms that tap the knowledge locked in the scientific literature require sophisticated natural language processing approaches. On the other hand, assessing document similarity is a comparatively easier task. A measure of document similarity that corresponds to semantic similarity between documents can also be powerful. For example, we might conclude that two genes are related if documents that refer to them are semantically similar.


Author(s):  
Soumya Raychaudhuri

The overarching purpose of this chapter is to introduce the reader to some of the essential elements of biology, genomics, and bioinformatics. It is by no means a comprehensive description of these fields, but rather the bare minimum that will be necessary to understand the remainder of the book. In the first section we introduce the primary biological molecules: nucleic acids and proteins. We discuss genetic information flow in living beings and how genetic material in DNA is translated into functional proteins. In the second section we present a short primer on probability theory; we review some of the basic concepts. In the third section we describe how biological sequences are obtained and the common strategies employed to analyze them. In the fourth section, we describe the methods used to collect high throughput gene expression data. We also review the popular methods used to analyze gene expression data. There are many other important areas of functional genomics that we do not address at all in this chapter. New experimental and analytical methods are constantly emerging. For the sake of brevity we focused our discussion on the areas that are most applicable to the remainder of the book. But, we note that many of the analytical methods presented here can be applied widely and without great difficulty to other data types than the ones they have been presented with. Here we present a focused review of molecular biology designed to give the reader a sufficient background to comprehend the remainder of the book. A thorough discussion is beyond the scope of this book and the interested reader is referred to other textbooks (Alberts, Bray et al. 1994; Stryer 1995; Nelson, Lehninger et al. 2000). The central dogma of molecular biology is a paradigm of information flow in living organisms (see Plate 2.1). Information is stored in the genomic deoxyriboculeic acid (DNA). DNA polymerase, a protein that synthesizes DNA, can replicate DNA so that it can be passed on to progeny after cell division.


Author(s):  
Soumya Raychaudhuri

Recognizing specific biological concepts described in text is an important task that is receiving increasing attention in bioinformatics. To leverage the literature effectively, sophisticated data analysis algorithms must be able to identify key biological concepts and functions in text. However, biomedical text is complex and diverse in subject matter and lexicon. Very specialized vocabularies have been developed to describe biological complexity. In addition, using computational approaches to understand text in general has been a historically challenging subject (Rosenfeld 2000). In this chapter we will focus on the basics of understanding the content of biological text. We will describe common text classification algorithms. We demonstrate how these algorithms can be applied to the specific biological problem of gene annotation. But text classification is also potentially instrumental to many other areas of bioinformatics; we will see other applications in Chapter 10. There is great interest in assigning functional annotations to genes from the scientific literature. In one recent symposium 33 groups proposed and implemented classification algorithms to identify articles that were specifically relevant for gene function annotation (Hersh, Bhuporaju et al. 2004). In another recent symposium, seven groups competed to assign Gene Ontology function codes to genes from primary text (Valencia, Blaschke et al. 2004). In this chapter we assign biological function codes to genes automatically to investigate the extent to which computational approaches can be applied to identify relevant biological concepts in text about genes directly. Each code represents a specific biological function such as ‘‘signal transduction’’ or ‘‘cell cycle’’. The key concepts in this chapter are presented in the frame box. We introduce three text classification methods that can be used to associate functional codes to a set of literature abstracts. We describe and test maximum entropy modeling, naive Bayes classification, and nearest neighbor classification. Maximum entropy modeling outperforms the other methods, and assigns appropriate functions to articles with an accuracy of 72%. The maximum entropy method provides confidence measures that correlate well with performance.


Author(s):  
Soumya Raychaudhuri

In this chapter we begin to address the issue of the analysis of gene expression data with the scientific literature. Here we describe methods for the analysis of a single experiment—one where a single expression measurement has been made for many genes within the same organism. In Chapter 7 we will address the analysis of larger data sets with multiple expression measurements for each of the genes; the questions that occur in that setting are often more complex and utilization of scientific text in that setting can be more useful. But focusing on a single series of expression measurements is an effective starting point in understanding the scientific literature and how it can be used with experimental data. The lessons here can be applied to a wide array of genomic assays besides gene arrays. These methods can be applied to any assay that assigns a single value to each gene In addition, many investigators generate single-condition expression data sets, and these methods are widely applicable. One of the great difficulties in analyzing a single expression series is that context is lacking. That is, we have a large set of isolated measurements. Each measurement corresponds to the log of the relative ratio of a single gene’s expression in an experimental condition compared to its expression in a control condition. These measurements represent a single snapshot of a cell’s physiologic status. One of the great challenges is sorting out the physiologically important expression changes compared to random experimental and physiologic aberrations and fluctuations. Gene expression measurements are subject to a great amount of noise and distinguishing true positives from genes that are not truly induced or repressed is a great challenge. Typically, investigators use their knowledge of biology to prioritize likely positives. In this chapter we argue that text-mining approaches can be used to help prioritize these genes instead. Another equally important challenge is to discern broadly what biological functions are active in a given experiment.


Author(s):  
Soumya Raychaudhuri

Text about genes can be effectively leveraged to enhance sequence analysis (MacCallum, Kelley et al. 2000; Chang, Raychaudhuri et al. 2001; McCallum and Ganesh 2003; Eskin and Agichtein 2004; Tu, Tang et al. 2004). Most of the emerging methods utilize textual representations similar to the one we introduced in the previous chapter. To analyze sequences, a numeric vector that contains information about the counts of different words in references about that sequence can be used in conjunction with the actual sequence information. Experienced biologists understand the value of using the information in scientific text during sequence searches, and commonly use scientific text and annotations to guide their intuition. For example, after a quick BLAST search, a trained expert might quickly look over the hits and their associated annotations and literature references and assess the validity of the hits. The apparently valid sequence hits can then be used to draw conclusions about the query sequence by transferring information from the hits. In most cases, the text serves as a proxy for structured functional information. High quality functional annotations that succinctly and thoroughly describe the function of a protein are often unavailable. Defining appropriate keywords for a protein requires a considerable amount of effort and expertise, and in most cases, the results are incomplete as there is an evergrowing collection of knowledge about proteins. So, one option is to use text to compare the biological function of different sequences instead. There are different ways in which the functional information in text could be used in the context of sequence analysis. One possibility is to first run a sequence analysis algorithm, and then to use text profiles to summarize or organize results. Functional keywords can be assigned to the whole group of hit sequences. Additionally, given a series of sequences, they can be grouped according to like function. In either case, quick assessment of the content of text associated with sequences offers insight about exactly what we are seeing. These approaches are particularly useful if we are querying a large database of sequences with a novel sequence that we have very little information about.


Sign in / Sign up

Export Citation Format

Share Document