Text analytics involves extracting features of meaning from natural language texts and making them explicit, much as markup does. It uses linguistics, AI, and statistical methods to get at a level of "meaning" that markup generally does not: down in the leaves of what to XML may be unanalyzed "content". This suggests potential for new kinds of error, consistency, and quality checking. However, text analytics can also discover features that markup is used for; this suggests that text analytics can also contribute to the markup process itself.
Perhaps the simplest example of text analytics' potential for checking, is xml:lang. Language identification is well-developed technology, and xml:lang attributes "in the wild" could be much improved. More interestingly, the distribution of named entities (people, places, organizations, etc.), topics, and emphasis interacts closely with documents' markup structures. Summaries, abstracts, conclusions, and the like all have distinctive features which can be measured.
This paper provides an overview of how text analytics works, what it can do, and how that relates to the things we typically mark up in XML. It also discuss the trade-offs and decisions involved in just what we choose to mark up, and how that interacts with automation. It presents several specific ways that text analytics can help create, check, and enhance XML components, and exemplifies some cases using a high-volume analytics tool.