In recent months we’ve been seeing a rise in the amount of interest in predictive analytics from our clients, so we’ve decided to run a series of blog postings
to explain what predictive analytics is, how it’s used and so on. This posting is the fifth in the series and it’s the first one looking in more detail at some of the techniques involved in predictive analytics. It examines the question “What are the main steps involved in carrying out predictive analytics?” and contains more technical content than the previous postings in this series, so is likely to be of more interest to analysts than to business managers.
One of the critical components of a successful predictive analytics project is a sound methodology, and this is crucial to getting the best results. There are a number of recognised approaches, some of which are proprietary. This blog explores the CRISP-DM process and phases as although it has its own limitations, it is one of the most widely deployed approaches and following this methodology has been shown to increase the likely success of a predictive analytics or data mining project. In a later blog posting we will go on to explain how Red Olive’s methodology for predictive analytics further enhances CRISP-DM.
CRISP-DM (CRoss Industry Standard Process for Data Mining) helps analysts focus on solving specific business problems with measurable goals.
It is a non-proprietary approach developed by a large consortium of organisations such as DaimlerChrysler, Teradata and SPSS, with contributions
from over 300 other companies. CRISP-DM was designed to be applicable in many different circumstances and its structured phases ensure that a project is conducted in a reliable and repeatable way.
The overall CRISP-DM process outlined:
The overall process is outlined in the diagram below and involves the use of 6 main phases:
The blue outer circle aims to illustrate the iterative and incremental nature of predictive analytics itself: over time the models need to be refreshed to take into account changes in the business environment, and are further enhanced as greater insight is gained.
There are six main phases within CRISP-DM:
1 Business Understanding involves clarifying the business aims for the project, converting this into a predictive analytics problem definition and designing a preliminary plan to achieve the objectives.
2 Data Understanding involves the analyst carrying out an initial data collection, familiarising himself or herself with the data collected and identifying any data quality problems that need to be addressed.
3 Data Preparation involves selecting the specific modelling techniques to be applied and then getting the data into a form for the modelling to be carried out. For example, different techniques such as CHAID or logistic regression can be used to solve different problems (there will be more about this in a
future posting in this series). The preparation steps are then selecting a sub-set of data for analysis, cleansing the data to address quality issues and transforming the data into a usable form.
4 Modelling involves applying the selected modelling techniques.
5 Evaluation involves checking whether the model properly achieves the business objectives and using the results to identify whether some important business issue has been missed.
6 Deployment involves generating model reports and releasing the tested model into the organisation’s decision-making process.
The two-way arrows between certain phases in the diagram show particularly close relationships where some iteration may be required.
Things to watch out for when using the process in practice – what practical difference does this all make?
It is all too easy to jump into a project without a clear understanding of the business problem that is to be addressed, and to end up answering a question which is not the one the business had in mind and which is of little business value. Symptoms include the “discovery” of relationships in the data which are trivial and were already well understood by the business. Better understanding of the business problem enables variables to be chosen in a way which prevents these problems arising.
Once understood, the business problem has to be translated into a predictive analytics problem. Using a clear and structured process helps because it breaks down the problem into clear, bite-sized steps.
The range of techniques available is extensive, and as with formulating the correct question, choosing the correct technique in a particular situation takes experience. Using a clear process helps particularly where the business question is complex, because multiple models may be involved and the process ensures that no confusion arises.
The industry standard process outlined in this blog posting contains a number of weaknesses. In the next posting we will highlight those weaknesses and explain how Red Olive overcomes them to increase productivity.