Monday, May 25, 2015

Lifecycle of the data analytics project

Data Analytics Project Lifecycle


What is different about data analytics projects


How data analytics projects (those that are related to the building of the models used for predictions, decision making, classification, etc.) are different from traditional software development projects? While final deliverable of the project is still typically some form of automated software system, the project stages are different. 

First of all, there is a new essential role in such projects - Data Scientist, a specialist, who possesses a skillset that is not there in a common software development project team. This person is not a Software Engineer, a Business/System Analyst or a System Architect. This is a professional who can make sense of the data arrays, to apply statistical and mathematical methods to the datasets to identify the hidden relations inside, and finally to validate the candidate models.

As far as the whole project success is greatly dependent on the result of the Data Scientist's work, this fully determines the lifecycle of the project.

Most popular models for data analytics projects lifecycle


There are several existing project lifecycle models for data science/analytics related projects, I see as most significant of them these ones:

1. CRISP-DM (CRoss Industry Standard Process for Data Mining) is widely accepted by some big players like IBM Corporation with its SSPS Modeler product.
2. EMC Data Analytics Lifecycle developed by EMC Corporation.
3. SAS Analytical Lifecycle developed by SAS Institute Inc.

Generic data science related project lifecycle


While the most popular models mentioned above use a bit different terminology and numbers of lifecycle phases proposed are different too, there are big similarities in all of these models. In general the phases can be described as follows:

1. Everything begins with business domain analysis.
2. Datasets accumulated as a result of business operations are being understood and prepared (extracted/transformed/normalized/cleaned-up/etc.).
3. A model, based on the datasets is being planned and built.
4. The model is evaluated/validated (including communication of the results to the upper management as this is a business value validation).
5. Operationalization and deployment of the model, including all required software development.

The lifecycle is iterative, the adjacent phases themselves can make several iterations.

As you can see only the operationalization phase is about software development in its traditional form, while all the preceding phases are related to the data science.

1 comment:

  1. It is very excellent blog and useful article thank you for sharing with us , keep posting learn more Big Data Hadoop Online Training Hyderabad

    ReplyDelete