In the last couple of years “Big Data” has become the buzz word in the worldwide Business Intelligence scene. And not only there: Top management consultancies have joined the IT crowd in praising the relevance of Big Data, pointing out the untapped strategic potential of Big Data processing and analysis.

Having spent more than half of my professional career with business intelligence topics, many times even evangelizing them, I am somehow enthused about these positive headlines on Big Data and analytics.  But looking at them from a more sober perspective, I strongly feel that this new hype needs to be put into perspective.

Yes, it is true that we are facing an unprecedented data deluge caused by the democratization of computing within and outside businesses and the resulting digitization of information and communication in the upcoming “All-IP-World”.  The worldwide data volume is predicted to increase by factor nine in the next five years – mainly due to the continued expansion of Social Media, Rich Media (esp. video streams) as well as geospatial and real time sensatory data feeds as the “Internet of things” evolves.

As I will argue in this post, however, most of these seemingly infinite data streams can be safely ignored from an analytic point of view. “Sometimes a cigar is just a cigar”  Siegmund Freud once said.  With regard to our topic we might continue “and data just manifestations of operational functions without further business value”. From a processing or storage perspective of course the picture looks different. New types of data management infrastructures are in fact needed to deal with Big Data. But these technical requirements need to be evaluated in light of their potential business impact. The real challenge is to identify those Big Data use cases, which significantly improve business propositions and processes. In other words: We need to move from a technical to a business driven view on the Big Data challenge.

Before explaining this conclusion in more detail, I would like to take a step back and have a look at the definition and the history of Big Data.

The McKinsey Global Institute suggests referring the expression Big Data to “datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” One might find it surprising that the research institute of a business consulting firm uses such a technical explanation. According to this definition Big Data disappears once databases can handle larger datasets.

Gartner offers a more fruitful definition. For them the term Big Data “used to acknowledge the exponential growth, availability and use of information in the data-rich landscape of tomorrow”. According to Gartner the key dimensions of this trend are increasing data volume, variety (data from different sources) and velocity (data describing movements in time). These three dimensional axes are in fact very useful when it comes to determining Big Data related business requirements.

IDC’s introduces data quality as additional characteristics of Big Data: Besides the rise of new types of structured data (geospatial information, datafeeds from sensors) unstructured (e.g. Video streams) and semi-structured data (such as texts in blog posts or on Social Media platforms) will drive the data explosion in the coming years.
In order to fully understand the Big Data phenomenon I suggest adding the “data generation context” as explanatory dimension .  Most of what we call Big Data is an operational byproduct of digitization. This kind of data might be turned into information through analysis, but their primary raison d’etre is operational.  E.g. most of digital communication (person-to-machine, machine-to-machine) is being stored without clear intention to analyze it – if we leave players like Google aside for a moment. Together with “intentionally” created data points, e.g. in context of customer profiling (e.g. through registration) and KPI measuring, they become the object of analysis. The same is true for digital traits of business processes and applications like e.g. in CRM or sales support systems. These data mostly reside in various operational databases and applications. Making them accessible for analytical purposes is a challenge on its own (we will come back to this aspect later).

Another implication of my proposed definition is that Big Data should not be seen merely as phenomenon of the presence and future. Digitization, which is almost ubiquitous today, did not happen overnight. Leading retailers for instance reached a data volume of petabytes before the “Big Data” label had been coined. The increase of enterprise data volumes certainly accelerated at the beginning of the millennium through the take-off of the Internet in general and Social Media specifically. But the Internet is not the only Big Data driver. Since the 90s IT systems started to produce unprecedented amounts of digital data. Consequently, many large brick & mortar corporates commenced experiments with high volume processing and analysis in this period.

In these early days many of these initiatives have been labeled as “data mining”. They usually started as proof of concept projects where the theoretical business case had to be confirmed with real world data. Early adopters had been large mail order, telecommunications and finance companies. This first generation of projects centered pretty much on basic transactional and contractual customer data. Early business case successes had been:

  • churn analysis, identifying customer profiles with high affinity towards contract termination
  • One-to-one marketing and microsegmentation in the context of customer acquisition and retention (incl. cross- & upselling) based on customer profile and preference analysis,
  • Fraud detection based on typical fraud profiles

This first generation of Big Data analysis led to multiple learnings:

  • Not every analytical insight is actionable, e.g. identifying individuals with a high churn propensity does not necessarily mean that contract termination can be prevented and that it is economical to do so
  • Big data based customer profiles do not always add value to existing market research based segmentations
  • Data analysis requires inputs from heterogeneous data sources, a minimum data quality and a logically consistent data model. As a consequence, usually a lot of manual effort went into data preprocessing (building data models, determining missing values, data cleansing, consistency check etc.) and modeling
  • The integration of analytics with existing IT environment is often complex and costly
  • The application of data mining methods goes beyond the capability of ordinary business users, even if they are analytically minded – thus, expensive analytics experts are needed to perform sophisticated analyses with high-end tools (e.g. SPSS data mining, SAS etc.)
  • Setting up analytics processes is a cross-functional endeavor – besides analytics experts stakeholders from IT and business lines need to be involved

In view of these challenges “Big Data” analysis remained by and large an “ivory tower” discipline with limited organizational footprint. This first Big Data analytics wave was big business focused. Small and medium sized enterprises shied away from investments in analytic infrastructure and overhead.

While analysts made their first experience with digital enterprise data, corporate IT departments started to replace project related ad hoc data extraction from operational system by a systematic approach to store and analyze data.  The goal was to create an integrated enterprise data view. Data warehouses (EDW) with structured ETL (extract, transfer load) routines for further analytical processing should do the job.

The key characteristic of the EDW concept is the separation of operational (also called transactional or OLTP layer) from informational environments. When this concept had been specified in the mid 80s two main reasons motivated this approach: The existing productive systems did not have the capacity to be enriched by further data sources, and conducting analyses on productive data was regarded as potentially cumbersome for data quality. In addition, EDWs had the function to enable historical trend analysis and structured business reporting along relevant dimensions. At that time this simply could not be realized on the OLTP layer (for the historical EDW emergence see Barry Devlin)

EDWs have been a major step forward in dealing with the increasing amount of enterprise data, but the downsides of this approach became obvious very soon:

  • EDWs duplicate data, thereby driving data volumes. The productive data are copied and then normalized in the EDW environment. From there they are copied into datamarts so that departmental analysts can satisfy their specific information needs.
  • Extracting data from the productive system has been a challenge in itself and a real capacity burner for IT departments and system integrators.
  • Since ETL (Extract, Transfer, Load) happens usually in batch processing mode the EDW and data mart view is rarely real time
  • EDW queries require SQL statements and a good understanding of the underlying data model. Thus, specialized professional expertise is mandatory to operate a EDW or even a datamart. As a result database personnel has become more and more an operational bottleneck, often discouraging business users to instigate data queries and analysis

These downsides explain why EDW projects last rather years than months and inflate IT budgets significantly, while business users are complaining about the unsatisfactory EDW performance.

As it turned out, there is also another, quite fundamental issue hampering the EDW approach: EDWs do not control the quality of the data inputs. If the various productive systems are built on different concepts of a user or a customer, the resulting confusion gets mirrored in the EDW. This is why Master Data Management Systems (MDMs) have evolved over the past few years. They address the root causes of data quality issues already on the OLTP level where so called Master Data repositories are being created with clear definitions, standards and data quality rules for key business objects such as customers, accounts, suppliers, products, locations etc.  The resulting master data hub displays authoritative data, or – to put in colloquial terms – only one version of the truth. The master data are then synchronized via a SOA layer with underlying productive systems, potentially correcting and augmenting these, as well as with EDWs and BI systems. On top of these technical MDM features business processes and data governance need to be defined, otherwise MDM will not work.

This does not only sound complicated, it is complicated. Top management needs a lot of strategic intent to make MDM happen. In view of the EDW and MDM challenge it becomes clearer why Big Data analysis have not been progressing faster on the corporate level of many enterprises.

On departmental level an entirely different trend took place during the past decade. More and more employees learned to operate with simple spreadsheet tools, especially excel, in order to build low to medium complex models such as business cases, reports, calculations etc. At the same time the “democratization” of business intelligence tools took place: Formerly expensive database server for data marts as well as stand-alone OLAP and reporting services have been integrated and offered to unprecedented low prices. Microsoft with its omnipotent SQL Server had been a major driver behind this trend. Microsoft incorporated even advanced data mining services into the SQL server, thereby competing head-to-head with expensive premium tools (although data mining as a methodology is to complicated to become a massively used application in the foreseeable future).

What does this mean for Big Data? Two opposite effects can be identified: The increased access to BI tools on departmental level certainly increased the ability to independently analyze data. The results of these analyses, however, can be viewed as bottom-up contribution to the data deluge. Unfortunately, most of these data do not make it to the central EDWs. They reside on departmental servers or – even more common – on the computers of the middle managers who generated these data. The term “excel silos” expresses the downsides of this approach: A lot of knowledge is not accessible for the entire corporation although it might be highly valuable.

BI firms address these effects by embedding spreadsheet functionalities into comprehensive analytics platforms. The currently most popular labeling for these platforms is Corporate Performance Management (CPM). The adoption rate (understood as actual usage) of these BI suites is limited. This does not come as a surprise since the upside for the individual user is limited (this will be the topic of a separate blog post).

An entirely different Big Data use case is the automation of business processes. The idea is to replace human decision-making through permanent measuring of process conditions in combination with business rules. Human intelligence is not entirely removed of course, since the business rules need to be defined beforehand. In this use case Big Data event streams function as inputs for the rules engine, which then trigger the start of processes and thus the generation of new data points for further decisions. In combination with the now more affordable distributed BI tools this type of embedded, actionable analytics, also called Complex Event Processing (CEP), can be tremendously useful in stable business environments. As long as processes are subject to frequent changes, however, the setup costs for intelligent process automation are simply too high.

The most recent and most important driver for Big Data volumes has not been specifically described yet: The rise of the Internet. The Internet is a huge digital touch point where individuals and companies leave all kinds of traces. A whole industry has emerged with the goal to turn these data into valuable business information. Web monitoring and analytics systems help to optimize the user experience on web sites. Cookie based tracking systems have lead to detailed customer profiling, giving advertisers the opportunity for personalized product offerings and advertising (see my older, but still relevant blog posts on this topic).  Apart from Google it was especially amazon with its personalization system that proved the value of Big Data. Many smaller players have adopted these best practices in recent years. The methodology of amazon is not unheard of though. The key item-to-item collaborative filtering algorithms of its patented recommendation engine for instance are in fact quite simple. What makes amazon really innovative is its ability to handle this massive amount of data under real time requirements. The same is true for Facebook, which also developed innovative methods to handle massive real time data analysis.

It is in this context that the so called NoSQL (mostly translated as “Not only SQL”) movement came into existence as alternative to Relational Databases Management Systems (RDBMS) as they are used in EDWs. Traditional EDW environment with their batch ETL processes, data quality checks and normalization procedures have simply not been fast enough to satisfy the specific needs of these huge online platforms.

NoSQL databases do use neither table schemas nor join operations. Consequently, they do not require SQL for queries. This deliberately less sophisticated approach relaxes the strict data quality and normalization requirements of RDBMS for the sake of better performance. Werner Vogel, the CTO of amazon, even avoids the term database for its NoSQL Dynamo system and instead speaks about a“highly available key value storage system”.

Key value stores like that of amazon or Facebook’s Cassandra system carry per key only the value of interest. Unlike in RDBMS all the other attributes of this key are neglected so that reading and calculating is much faster. In addition most of these key value stores are column based, which has also a positive effect on performance.

Facebook databases are also graph based, which is another form of NoSQL. Business objects are displayed as nodes to which attributes can be flexibly assigned. Transferred to social network, nodes are persons and they are linked with each other. Thus, graph databases are an highly efficient way to store

social graphs. If for instance the relationship path between two persons on Facebook shall be determined, the systems just needs to follow the connected nodes. In a relational database the identification of the connecting path would afford significant computer power and time, because it had to go through the friend relationships of the entire user IDs between these two nodes in order to determine the connecting path. Thanks to graph databases such Neo4J, FlockDB, AllegroGraph, Graph DB or Infinite Graph this effort can be minimized.  The principle of graph databases can be applied to all kinds of network like or hierarchical structures. In small to medium size datasets, however, the advantage over relational databases is less relevant.

Real-time analysis of Big Data requires massive processing power. Google was one of the first Internet giants facing this problem. Its answer was MapReduce, a software framework for processing large datasets on distributed clusters of computers. A similar approach had been taken by the open source project Hadoop which big online players like Yahoo! and Facebook deployed successfully. A specific strength of Hadoop is the efficient processing of unstructured data as it is common in Social Media.

The fact that Big Data online players are quite successful with NoSQL databases together with distributed application frameworks does not mean the end of EDWs. Firstly, real-time performance is not always required. Secondly the processing of unstructured data becomes more important, especially in light of the Social Media and semantic technology revolution, but is still not a key requirement in many industry sectors. Thirdly depending on the type of request, EDW databases cannot easily be substituted in case of complex queries on structured data.

Thus, Big Data technologies like Hadoop are for the time being rather supplements than substitutes of corporate Big Data storing and processing environments. Companies like Cloudera are actively pushing the commercialization of Hadoop. Oracle and IBM have already launched Big Data suites with Hadoop as one key component.

These developments can be interpreted as first steps of a major paradigm shift in enterprise data management. According to Barry Devlin, a seasoned database expert, the EDW of the future will integrate operational and informational environments, thereby avoiding the duplication of data.  The idea is to have just one layer for all conceivable data types: from structured data such as transactions, measurements, spreadsheet analysis over semi-structured data such email or call center protocols to unstructured data such as video data feeds. All of these data would be accessible real time via metadata, which are also part of this unique data layer. These metadata qualify the data logically so that they can be accessed in an action oriented (case-by-case) way or integrated into workflows. A separate interface layer would reflect the different use case need from an actor perspective, spanning from an real time view of data entries to an in-depth, tool supported analysis of historical data records.

Technology wise Devlin’s visionary BI architecture has started to materialize. SAP’s HANA (High-Performance Analytical Appliance) platform is currently the most ambitious initiative in this respect. HANA is in its core an in-memory column and raw based database, which will eventually become the one and only data layer for SAP applications. The first step on this strategic architectural roadmap lets HANA function as an additional layer on top the existing OLTP and business warehouse layer for agile modeling and reporting. The promise here is to speed up data analysis by a three to four digit factor. The second step is to integrate the data management function of the EDW so that the EDW layer just manages the HANA metadata. At this stage HANA will already have a built-in application engine so that applications can send data intensive operations to HANA. The first two steps are already underway. In the medium term HANA will replace the OLTP layers. All applications will run on data residing in-memory. The whole solution will also be moved to the could where it will be the basis for multitenant Java and ABAB applications (see Gartner, SAP Throws Down the Next-Generation Architecture Gauntlet With HANA, Oct 13, 2011)

No doubt, the bandwagon is rolling, but its speed and extension largely depends on concrete business cases. Alternatively you can of course become a true believer right away as a former SAP employee and current consultant for SAP seems to suggest: “Once HANA is in place as the database under SAP BW, customers will find many more ways to use HANA to transform their enterprises to much higher levels of performance, much as word processors, e-mail, and browsers are transforming business and society.” How can you respond to this one? May be with a slightly modified proverb: “In god we trust, all others bring business cases”.

The current business case argumentations for the new data management revolution can be clustered into six types:

  • “Analytical land grabbing”: Greater analytical power enables the calculation of previously not analyzable data volumes
  • “Deeper actionable insights”: Faster analysis leads to more scenario iterations and thus to better planning quality on a granular level
  • “Higher data reliability”: Time savings from accelerated analytical processing can be used for more detailed data quality check and on the fly corrections
  • “Bigger bang for the buck”: Higher analytical power translates into a better price-performance ratio with regard to direct software costs
  • “Platform consolidation gains”: Lower software infrastructure CAPEX and OPEX thanks to consolidation of data management layers
  • “Data mashup”: Enriching enterprise information landscape through merging Enterprise data with unstructured data, e.g. from Social Media

How convincing are these benefits in the short and medium term? Which are the most affected use cases? Which industry and lines of business could be first movers? These question will be discussed in my next post.

This content is published under the Attribution-Noncommercial-No Derivative Works 3.0 Unported license.

Tagged with:
 

3 Responses to “Putting Big Data into perspective – Part 1: Big Data roots and vision”

  1. Dennis Moore sagt:

    I enjoyed your blog.

    You referred to a blog I wrote ( http://www.enterpriseirregulars.com/44086/what-are-the-killer-apps-for-sap-hana-and-other-in-memory-computing-systems/ ) in your analysis. The reason I wrote the blog in the first case is because, at the time, there was a heated discussion in the SAP ecosystem community about what were (and whether there were) “killer apps” for SAP HANA.

    Killer apps, by consensus, are apps that have such a great business case and such transformative value that they justify the disruption and cost of bringing in a new platform. The archetypal example of this is the spreadsheet, which had such a high ROI as to usher in the purchase of personal computers (beginning with Apple II and continuing to the PC) in businesses.

    The blog you linked to has as a main point that only a business case can drive the adoption of a complex and costly platform like SAP HANA. After all, SAP HANA is not a fully proven technology at this point, has a very high price tag (but perhaps not compared to its business value), and will require the development of new skills in the enterprise.

    The performance improvements SAP HANA offers compared to traditional disk-based databases are so dramatic, that it is not hard to imagine scenarios where the speed improvements will lead to business model disruptions (and thus business cases). But each business will have to determine whether the return would justify the investment and risk.

    On a separate note, I am not aware of how HANA is suitable as a “big data” system, nor how HANA is suitable as a multi-tenant platform. I am not saying that SAP HANA is unsuitable for these use cases, but I am not aware that it is or isn’t.

    My full blog, including all the example “killer apps” that could potentially drive adoption of SAP HANA and in-memory computing, is available at http://www.bluefinsolutions.com/insights/guest_blog/what_are_the_killer_apps_for_sap_hana_and_other_in-memory_computing/ .

    Thanks for furthering the debate!

    - Dennis

  2. Thanks for your comments, Dennis. I am very much in line with your killer app argument and I also believe that the SAP Hana roadmap looks quite promising. The point I am trying to make here is that we should be as specific as possible when we talk about short term business cases. Some of the use cases which SAP has published recently I do not find very convincing. I will explain my thoughts about this in my next post.

    The reason why I think that HANA is relevant for Big Data follows from my understanding of the topic. I am aware that some people just refer this term to unstructured data such as Social Media content and structured data from machine-to-machine communication. As explained at the beginning of my article, I think this definition is too narrow. I define Big Data as a byproduct of computerization and digitization of business and consumer communication, following here the wider definition of Gartner (volume, velocity, variety).

    As far as the multi-tenant capability of HANA is concerned I refer to an insightful white paper from Gartner (“SAP Throws Down the Next-Generation Architecture Gauntlet With HANA”, Oct 13). But of course, this is still a high-level scenario and the boundaries between actual capabilities and roadmap plans are often unclear.

    Hope this clarifies a bit my key messages.

  3. [...] Handlungsdrucks an. Auch die für die Auswertung von Massendaten (in der aktuellen IT-Debatte auch „Big Data“ genannt) befindet sich in einer vergleichsweisen frühen Entwicklungsphase. Das gleiche gilt für [...]

Leave a Reply