Taming the text analytics challenge

Date:- Jun 30, 2016

Text Analytics, what is it and why are we even discussing it? If we have written feedback for poor customer service or sent a mail to a bank asking for some assistance or even written a random post on a social networking site or reviewed some product on a blog, then we have given fodder for analysis, albeit without structure. Analysing text is extracting high-quality information from this unstructured content (far different from the analysis of structured data with well-defined attributes). It is generally believed that only 20% or less of the data in any organization is structured, the remaining 80% being unstructured. If the businesses are getting value out of analysing just 20% of this structured data using BI tools, imagine the potential the remaining 80% holds.

By ‘analysis’ of the textual data, we mean any of the following (but is not limited to):

  • Information Retrieval
  • Sentiment analysis also termed as Opinion Mining
  • Text Categorization/ Classification/ Topic identification
  • Text summarization
  • Named Entity extraction
  • Language detection
  • Speech Recognition
  • Analysis of the combination of structured and unstructured data

In each of these tasks, the textual content is converted into a quantifiable form which in turn can be used for predictive modelling. The use of dictionaries also facilitates predictions. Some dictionaries used are standard and easily available on the web (e.g. dictionary consisting of positive and negative words used for sentiment analysis, dictionary identifying parts of speech, short forms used, etc), some however are domain-specific and need to be built. Text analytics works best when statistical modelling is used in conjunction with dictionaries.

Current Applications of Text Analytics amongst others are:

    Voice of Customer Analysis/ Customer Insights Social Media text analysis Recruitment
  • Lead generation
  • Review sites including product, customer service, movie reviews
  • Identification of top trending topics

It is a wonder at times, how simple techniques in text analytics give very meaningful insights, not possible otherwise.


Let us now go through an in-house case study related to intelligent routing of emails to the appropriate customer care executive, titled ‘Mail categorization using Text Analytics’

Background and Problem statement

The client is a provider of financial services to consumer and wholesale businesses. It has a comprehensive product suite that can cater to multiple financial needs of customers. Earlier, any email sent by the customer was manually categorized and handled by a customer care representative. The challenge was to replace this manual intervention with automatic routing of mails to the correct customer care representative.

The incoming text from the mail server had to be read and the relevant keywords identified. A corpus of keywords was built from a years’ worth historical mails. The keywords were primarily used for the classification of mails. Also, an algorithm was developed to handle the mails which did not contain the phrases or keywords present in the dictionary built. After classification, each category was to be catered by an assigned customer care executive through a CRM tool.


Theoretically, it was a straightforward process of reading the textual data from the mails, converting it to a format that could be used by a classification algorithm suitable for handling textual information and get the categories required. However, as it often happens, reality is very different from theory.

Some of the other points which needed to be considered upfront were:

  • Since this was supervised learning, the historical training data provided by the clients should have been correctly classified to train the model well. However, this was not the case. Therefore, even if we used the best model, we would not get the correct classification. One task was to find the correct categories for each of the 50000 mails.
  • The textual data in the mails sent by customers was not in an ideal format, containing spelling errors, grammatical errors, short forms, etc. It needed a thorough cleaning before being of any use to the algorithms being considered. Since the quality of input determines the quality of output, the input given to the algorithm had to be made as clean as possible. A lot of time and energy was spent to clean the data and get it in a format that could be used by the models to generate valid categories.
  • If the mail being considered for classification was a trail mail, the sender could have changed the topic during mail exchange. Using such mails for training would dilute the categorization algorithm since it could contain multiple requests for different categories in the same mail chain.

The categories in which the mails were to be classified were shared along with sample emails by the clients. The use of standard classification algorithms like Support Vector Machines (SVM), Classification and Regression Trees (CART), and Random Forests on the whole training dataset gave an accuracy between 55% to 65% which was clearly not acceptable.

It was therefore required to think of alternatives to the existing standard models and come up with a customized algorithm used in conjunction with dictionaries that suited the clients’ needs.

Methodology – Creation of the algorithm

The end objective was to classify the mails in the correct category, each category being handled by a customer care representative. To streamline this task, the mails were first categorized in high-level categories and then segregated further to specific second level sub-categories.

A subset of the training mails in each category was taken and the most frequently occurring words or phrases were identified (technically referred to as n-grams). These keywords helped in identifying which category a mail belonged to. In other words, dictionaries were created for the dataset which assisted the algorithms to better classify the mails on the basis of specific ‘identified’ keywords present in each category.

However, it was not possible to identify all the keywords which the customers would use when sending mails. Hence, a model was also built to assist in classifying those mails which did not get classified using dictionaries alone.

Algorithm: A score was given to each keyword/phrase present in a category. The more the number of times a word was present in a category, the higher was its association with that category and the higher the score of that word for that category. So, for an incoming mail, the sum of association scores of the relevant words present in the mail was calculated. The category for which the score was the highest, was the category to which the mail was assigned. This enabled the algorithm to classify emails with high accuracy.

The end product was a combination of a dictionary and a model resulting in highly accurate classification.

Maintenance: After every quarter, the new data collected in the previous quarter (including the misclassified mails) is then fed into the algorithm, which then refines the model further leading to better predictions. In other words, as more correctly classified data is available, the model improves and the machine (model) learns.

Benefits from this project

Eliminated error induced due to bias as there was no dependence on the manual categorization of emails. Also, the time needed to reply to customers reduces thereby assisting in better customer service. The dashboards helped the management to identify which categories were the primary pain points of the customers. This made the company more competitive, reduced customer churn and increased the overall purchases.

Happy to hear your thoughts and views.




Your email address will not be published. Required fields are marked *