Text Analytics

Text Analytics

Text Analytics is the process of converting unstructured text data 4.76(95.22%)46 into meaningful data for analysis, to measure customer opinions, ratings product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision making. Text analysis uses many linguistic, statistical, and machine learning techniques. Text Analytics involves information retrieval from unstructured data and the process of structuring the input text to derive patters and trends and evaluating and interpreting the output data. It also involves lexical analysis, categorization, clustering, pattern recognition, tagging, annotation, information extraction, link and association analysis, visualization, and predictive analytics. Text Analytics determines key words, topics, category, semantics, tags from the millions of text data available in an organization in different files and formats. The term Text Analytics is roughly synonymous with text mining.

Text analytics software solutions provide tools, servers, analytic algorithm based applications, data mining and extraction tools for converting unstructured data in to meaningful data for analysis. The outputs, which are extracted entities, facts, relationships are generally stored in a relational, XML, and other data warehousing applications for analysis by other tools such as business intelligence tools or big data analytics or predictive analytics tools.

Text Mining and NLP

Text mining is the process of converting unstructured or semi-structured oral and written data into structured data for exploration, analysis, and interpretation.

NLP or Natural Languages Processing is the overall term for the processing, exploration and analysis of data using computational linguistic tools and approaches. Other methodologies involve statistical tools and machine learning. NLP tools that are commonly used to facilitate text processing and information extraction are spell checkers/correctors, tokenizers, stop word removers, stemmers, lemmatizers, POS-taggers, chunkers such as sentence splitters, syntactic parsers, thesauri and ontologies, keyword extractors.

Automated speech recognition (ASR) tools partially replace manual transcription. They can mark silences and recognize different speakers in, for example, an interview. An important surplus is that ASR converts spoken language into text which can be aligned with the spoken fragments. Emotion recognition detects positive and negative feelings, taking into account silences, tone/pitch and role-taking.

Named entity recognizers (NER) extract named entities, i.e. proper nouns such as person names, names of organisations, geographical terms, but also dates, percentage, numbers. 

Other Information extraction (IE) tools detect keywords, generate frequency lists and summaries, produce word clouds, and categorise documents. There are also excellent tools for discovering (unexpected) patterns such as concordances (KWiC/Keywords in Context) and correlations.

Wordle

A piece of text which has been rearranged into a visual pattern of words. Below a wordle of the Wikipedia article about Oral History.

wordle

A wordle is a visual depiction of the words contained in a piece of text, as exemplified within the citation above. Generated by a web-based tool of the same name, a wordle is created by manipulating the words of an input text and arranging them into a kind of graphic. The more frequent a particular word was within the source text, the bigger it's displayed in the wordle. Font and colour variation, as well as adding visual appeal, may also give weight to particular words, which are positioned vertically as well as horizontally.

Wordles, also sometimes referred to as word clouds or text clouds, have recently been popularly used to visualize the topical content of textual information such as the transcripts of (OH-)interviews, political speeches and other spoken content. For an example, check out a 2009 article in the New York Times, which features a wordle of President Obama's inaugural speech, and note the emphasis on words such as America, new, nation and every. If you want to have a go at creating your own wordles, check out www.wordle.net.

Background – wordle

The Wordle tool, and thereby the word wordle, is the brainchild of Jonathan Feinberg, a senior software engineer at IBM. The tool's popularity and usefulness has led to the nomination for a 'Webby' (an international award given to people involved in web design and web-based media).

Although a Wordle is in principle language independent, it make sense to pre-process the input text according "language dependent rules". Stop words like "is", "the", a", "an", "then", "to","have", etc. are used in many phrases and do, in general, not contain a lot of interesting information. To avoid boring Wordles, stop words need to be removed for each recognized language. The Wordle algorithm needs to identify the laguage or this information needs to be provided by the user.

To identify the language, the Wordle algorithm selects the 50 most frequent words from the text and counts how many of them appear in each language’s list of stop words. Whichever stop word list has the highest hit count is considered to be the text’s language. Then, the language dependent stopwords list can be used to remove the stopwords.

A goof explanation of the way his software works, can be found on his homepage.