"Will Hadoop replace my current data warehouse if I adopt a Big Data strategy?" It's a question I've been asked before. The answer is no. Here’s one reason why. The BI analyst plays a different but complementary role to the data scientist.
According to Wikipedia:
In Chinese philosophy, the concept of yin-yang, literally meaning "shadow and light", is used to describe how polar opposites or seemingly contrary forces are interconnected and interdependent in the natural world. Yin and yang are not opposing forces (dualities), but complementary forces, that interact to form a greater whole, as part of a dynamic system.
I'd like to suggest that the functions of the traditional BI Analyst and Data Scientist are at opposite ends of the spectrum, and yet they are complementary, interconnected and interdependent in the business world. Yes, I’m going out on a bit of a limb here. Bear with me. The primary focus of the traditional BI analyst is reporting for decision support in business operations, and the primary focus of the data scientist is insight discovery for business transformation. Yes, I am making the roles black and white when it’s really a spectrum of gray, but I think the exercise has merit.
The traditional BI analyst is focused on reporting on the 'one and only truth'. They are supporting business operational decisions - it's all about the statements of fact, answered instantly. Normally, the information analyzed comes from data generated in transactional systems; data that is highly structured. Think about it for a second. When reporting on earnings a publicly traded company can't make a statement like the following:
"There's an 80% probability that we closed $3 billion in revenue in last quarter, with a 30% chance that expenses were under $2 billion."
You have to report cold hard facts.
Data warehousing practices are designed to support the traditional BI analyst's role. Think of all the time spent to prepare data - cleanse it, structure it, document it, validate it. Data warehousing defines strict (I won't say rigid) practices so that users can be confident in the resulting 'truths'. Consider the fact that for years the BI community has debated the importance of a 'single version of the truth'. BI analysts and the data warehousing community have built a process and mindset around fact-based answers in order to best support decision making in business operations.
The Data scientist, however, is focused on a different but interconnected problem and embraces a philosophy that believes there are 'many paths to wisdom'. There may not always be one right answer, and usually it comes down to statistical probabilities that are so hard to pin down. Her goal is to find ways to transform the business, or parts of it, by discovering new insights from data. It follows an approach where questions are asked, hypotheses formulated, and then "proven" right or wrong. There are no guarantees that the question will be answered satisfactorily, and no guarantees about how long it will take.
A good data scientist approaches the problem with an open mind about what data, analytic techniques, or tools should be used, and even how to interpret data and its meaning. While a good data scientist spends a lot of time preparing data - cleanse it, structure it, document it, validate it – the strategies used may vary each time. In essence, the mindset of a data scientist focuses on questions, an open mind to the where the answer lays, and flexibility in the approaches used.
Companies have invested heavily in data warehousing processes because they effectively ensure BI analysts have the ‘facts’ and can confidently report against them. They don’t need to throw out those processes and tools because the underlying database cannot handle Big Data. They simply need to replace their traditional database with one designed for Big Data analytics.
On the other hand Hadoop was designed to give users the flexibility to store and analyze all of their data any which way they like. It’s too much flexibility for a traditional BI analyst, but it is an ideal tool on the data scientist's tool belt – right beside the analytic database.
The BI analyst and data scientist complement each other. Hadoop and data warehousing environments do too. Together they form a 'greater whole'.