Text Analytics

Text Analytics is the process of converting unstructured text data 4.76(95.22%)46 into meaningful data for analysis, to measure customer opinions, ratings product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision making. Text analysis uses many linguistic, statistical, and machine learning techniques. Text Analytics involves information retrieval from unstructured data and the process of structuring the input text to derive patters and trends and evaluating and interpreting the output data. It also involves lexical analysis, categorization, clustering, pattern recognition, tagging, annotation, information extraction, link and association analysis, visualization, and predictive analytics. Text Analytics determines key words, topics, category, semantics, tags from the millions of text data available in an organization in different files and formats. The term Text Analytics is roughly synonymous with text mining.

Text analytics software solutions provide tools, servers, analytic algorithm based applications, data mining and extraction tools for converting unstructured data in to meaningful data for analysis. The outputs, which are extracted entities, facts, relationships are generally stored in a relational, XML, and other data warehousing applications for analysis by other tools such as business intelligence tools or big data analytics or predictive analytics tools.



A piece of text which has been rearranged into a visual pattern of words. Below a wordle of the Wikipedia article about Oral History.


A wordle is a visual depiction of the words contained in a piece of text, as exemplified within the citation above. Generated by a web-based tool of the same name, a wordle is created by manipulating the words of an input text and arranging them into a kind of graphic. The more frequent a particular word was within the source text, the bigger it's displayed in the wordle. Font and colour variation, as well as adding visual appeal, may also give weight to particular words, which are positioned vertically as well as horizontally.

Wordles, also sometimes referred to as word clouds or text clouds, have recently been popularly used to visualize the topical content of textual information such as the transcripts of (OH-)interviews, political speeches and other spoken content. For an example, check out a 2009 article in the New York Times, which features a wordle of President Obama's inaugural speech, and note the emphasis on words such as America, new, nation and every. If you want to have a go at creating your own wordles, check out www.wordle.net.

Background – wordle

The Wordle tool, and thereby the word wordle, is the brainchild of Jonathan Feinberg, a senior software engineer at IBM. The tool's popularity and usefulness has led to the nomination for a 'Webby' (an international award given to people involved in web design and web-based media).

Although a Wordle is in principle language independent, it make sense to pre-process the input text according "language dependent rules". Stop words like "is", "the", a", "an", "then", "to","have", etc. are used in many phrases and do, in general, not contain a lot of interesting information. To avoid boring Wordles, stop words need to be removed for each recognized language. The Wordle algorithm needs to identify the laguage or this information needs to be provided by the user.

To identify the language, the Wordle algorithm selects the 50 most frequent words from the text and counts how many of them appear in each language’s list of stop words. Whichever stop word list has the highest hit count is considered to be the text’s language. Then, the language dependent stopwords list can be used to remove the stopwords.

A goof explanation of the way his software works, can be found on his homepage.