In the previous post “Two Sides of “Big?” Data”, we took a simplistic view of “Big?” data with two facets of the challenge as storage and retrieval. This post reviews the “Storage” facet to uncover a few key realities of data accumulation. The basic premise behind the discussion is that functional use cases for storage and retrieval are different and independent of each other.
The data has grown “Big” on us unknowingly, inadvertently, and unintentionally as a function of its size. Otherwise, it is an ancient, age old reality that exists irrespective of the medium or form of its persistence. Its growth is directly proportional to the sophistication of the medium of capture. When it comes to storage or archival of data, we are faced with a few of the following:
In the digital world, data is captured as bits and bytes without exception. This is much different from pre-technology storage like scrolls, paper or other methods of capturing data. The digitized data thus stored may not match exactly to its source form due conversion inefficiencies in certain cases. While the gap is being narrowed, a variance still exists.
As we expanded categories of data captured, we added new formats that either correspond to category and/or media used for distribution. This progression has continued since punch cards, tapes, floppies, hard disks, CDs, DVDs and goes on. Irrespective of the media, data manifests in format and structure.
We have seen subject data stored using one media format though using different structure(s) in its lifecycle. This is further complicated by data with variable meaning stored using similar structure. The varying meaning corresponds to when such data is created. Multiple permutations of data content, format and structure are given characteristics of data.
With every new channel of engagement, we build new funnels to absorb as much data as possible. The challenges are how quickly we capture and how much. We are interacting continuously, instantly, from everywhere and with everyone. What if we miss a bit? Is it even possible to capture every bit of information?
Our data is locked in the platforms that we use to capture and/or store it. This is true of every single platform deployed to store data. Then we expend much work to assimilate the data to mean and look the same. Reluctant vendors unwilling to offer ease of integration and standards complicate this further.
The above list only is representative at best but can constrain a user’s ability to consume data based on the method and purpose of use. There have been attempts to tweak storage capabilities to support ease of access and retrieval. Our legacy data landscape reflects consequences of these attempts. This accentuates the need for separation of capabilities on the storage as well as retrieval sides.We will continue to leverage legacy systems as we add new ones leaving us little room to fix such legacy issues. The past solutions manipulated storage using extract, transform and load (ETL) processes, the transform was to make data appear as the consumers would like it. This pursuit consumed much effort, time and cost, adding to our storage nightmare.
How can we solve the data storage?
A simple, straight forward answer is to store data in its original form without burdening it with retrieval or consumption demands.
Today, technology has much to offer as data management capabilities have progressed rapidly, breaking the relational database barrier. Rich and sophisticated new age databases are resolving individual needs or combination of multiple needs such as performance, scalability, flexibility of format and structure. The new technologies offer variety and versatility to capture data as close to the source form as possible and store it without additional process steps to fine tune it for end user consumption.
This brings us to the “Big” data conversation fueled by the emergence of new database technologies. A platform that resolves data storage conundrum as it offers a foundational storage layer capable of storing any type of source data. It can act as a catchall repository that stores “as is” data in its native form regardless of how it is captured. By eliminating steps to transform or massage data during storage lends to speed, efficiency as well as reduces redundancy, duplication and silo’ing.
Building such a catchall repository will enrich and extend prevailing storage capabilities, leading to possibilities like –
- Reducing and/or eliminating upstream extract, transform and load (ETL) work?
- Making data available in “as is” form?
- Removing duplicate, redundant and disparate repositories?
- Capturing all types of data without regard to how it is stored?
Realizing such possibilities requires careful planning so as to minimize disruption and maximize benefits through the right solution architecture.
With the new technologies such as NOSQL, Apache possibilities are limitless, with many features driving enrichment of data. The capabilities resolve key big data challenges of volume, velocity, variety and veracity. The toolset can help weave enterprise information fabric as the enterprise data backdrop. Such an undertaking is complex and could be disruptive. A few questions to ponder are –
- How to deploy new technologies in the existing technology landscape?
- What is the scope of disruption this may cause?
- Will this impact existing data storage capabilities and how?
- How to pick the right repository based on source data?
- What are known risks and challenges?
It is possible to integrate new technologies within existing systems in a seamless manner with the right plan. The key is to identify the right data capture platform to populate into Hadoop like Data Lake. Can this lake, acting as enterprise data backdrop, act as the single version of truth? It is possible with an incremental approach to manage individual data sources while affecting archival data into the Hadoop data lake. It takes careful planning to setup the right solution architecture based on data priorities. As always there are multiple paths to a single destination. We have high level proposal to resolve the storage dilemma though further elaboration is needed.
In our next post let us discuss the “retrieval” challenge? In the next post we will focus on addressing the data retrieval.
*Platformity – describes constraints inflicted by the deployed platform