What is text analytics / text mining?
While most of our data mining work still relates to things that can be represented by numbers, an increasing amount also requires text mining using natural language processing (NLP). But what does that mean and what’s involved?
Text mining involves applying analytics to the understanding of text. Typically we are interested in identifying things like:
- the number of times a particular concept (Kim Kardashian, terrorism, fraud… ) occurs in a particular body of text and what this may show us about the level of interest in certain subjects among the text authors,
- the correlation between certain concepts in the text and what this may mean about the opinions of the text authors,
- the sentiment of the words used and what this may show us about the attitudes of the text authors towards the subjects being discussed.
How is text mining carried out?
This diagram outlines the text mining process and illustrates it using the example of a large telecoms company scanning across social media feeds to identify incoming text feeds and classify them so they can be routed to the most relevant part of their organisation for action.
In this case the incoming message is from Facebook. Two actions are applied to it, sentiment analysis to assess whether it is positive or negative and parsing using a “part of speech” (POS) tagger; these may be proprietary or open source depending on the software being used. The tagger adds structure to the text and this is subsequently used during the analysis phase. In this case the sentiment is strongly negative.
The next stage is “concept extraction” in which concepts are extracted from the tagged text.
A dictionary can then used to match a particular word or concept to a particular subject deemed to be of interest, for example network, product…
These dictionaries are very domain-specific and it is in the iterative creation of a dictionary that text subject matter expertise is critical. In this case the concept “signal” has been linked with the subject “network” and so this message can be routed to the network team within the organisation for action.
If you’d like to discuss how Red Olive can help you with your text mining goals, please contact us here or by calling us on +44 1256 831100.
In the next text mining posting we will look at a real example taken from publicly available data: the UK government’s Hansard data relating to two politicians, one from the governing Conservative party (right of centre) and one from the opposing Labour party (left of centre).