Big Data Needs Solid Requirements


Big data is too big for most organizations. It’s not because there is too much raw material to deal with, nor is it because of a lack of applicable tools. It doesn’t even have anything to do with the inclusion of unstructured data (which doesn’t really exist but that’s a topic for blog posts that I’ve already written). No, big data’s too big because of what’s possible.

The possibilities to slice & dice are virtually endless. Numerical data is bombarding organisations from within and without. Text based data (I’m calling it data because until it’s put in context it’s not really information) is being generated almost at the speed of thought, in quantities heretofore unimagined. Every transaction, every search request, every Tweet, and every Like generates an entry in some repository that organizations may or may not be aware of, have control of, or have access to.

That all sorta sucks, but …

But what really sucks is that organizations jump onto the Big Data bandwagon with not an iota of a clue as to what they want to do with it. Boys and girls, Uncle Chris is gonna try and wise you up.

Like any other project that involves expending time and money, you need to know what you want to achieve when you’re done. E.g.: Many years ago I worked on a service management reporting project for a big organization providing managed network services to an even bigger organization. It came down to this … approximately 17,000,000 rows/day of raw data were collected, $pooploads/mo of revenue depended on meeting SLA targets, unmet SLA resulted in pooploads-alot being lost. The specific metrics and their data sources were identified prior to spending a dime on tools. My point is, you need to figure out what the business requirements are. That was true in 2002 and it is true today.

Much has stayed the same, and some has changed over the last 10+ years. We’ve even got some new stuff we can play around with thanks to social media, text analytics, sentiment analysis, etc. But knowing at least a few of the questions we’re trying to answer, before actually doing something, is still a valid and necessary first step.

It’s really cool that we can now ask “How does [demographic of choice] feel about our support organization?” in addition to asking about how many units of blue-widget-A we sold last quarter in the mountain time zone north of the 49th parallel. But before we ask the question we need to know to ask it and we need to know what we’re gonna do if we don’t like the answer (we should also have a social media strategy in place). We also need to know niggling little details like where the data is, whether or not we can access the data, and whether or not the data is reliable (whatever that means). Oh, we should also have some sort of governance in place to deal with all that personal and payment data we’re collecting, storing, massaging, analyzing, and interpreting to generate more profits than ever before.

I’d like to end today’s sermon with another little story …

Back in 2004 I was a project manager at a municipality. One of my periodic tasks was to compile the results detailing uptake of certain web-enabled municipal services related to planning and development. Each month I would get the results from the various sub departments, enter them into my fancy-schmancy reporting tool, compare the numbers against the projections, and then present them at a monthly meeting. We used a standard red-black-green thingy and it was all so easy. Easy until the dude in charge asked me if they were supposed to do anything about the red (bad) numbers. My question, and the take-away from this anecdote, to him was “If you’re not going to address the issues highlighted, why are you spending time and money on this?”

Big data is full of big possibilities. However, before you jump in make sure you have a plan. Understand what it is you’re trying to achieve. Have a plan for how you’re going to react to negative results as well as positive. Know that you won’t figure it all out on your first attempt, but that’s okay because a cool thing about analytics is that the more you play, the more you learn and then you discover more possibilities.

The bottom line is that if you don’t know what you’re trying to accomplish or what questions to ask, it makes no difference if you’ve got a couple gigs of data or multiple petabytes of data.

Big Data? Big Whoop.


Big Data? Big Whoop!

Over the past couple of days I’ve been seeing a number of posts and tweets about Biiiiig Dataaaaaaa (ring announcer voice in my head)! What is “Big Data”? Check out the definition in this executive summary; or as I and others like to say, “It’s as big as a piece of string is long”. I certainly understand the idea behind “Big Data”, but do we really need a new term for something that, let’s face it, isn’t new at all?

In a comment to this post I used the phrase “E2.0 meets BI”. To be more accurate I should have said “E2.0 fuels BI”. This whole “Big Data” thing is nothing more than reporting and analytics, but with more data than we had before. Those of us who have a stake in the BI domain have often wished for more raw data on which to base our decisions. Now that we have it, and are getting more of it every second, we’re freaking out and giving one or more major vendors in the space an opportunity to define something new. Two things, and only two things, have really changed:

  1. The available amount of raw data is way beyond what it was only a short time ago;
  2. The Cloud and SaaS jeopardize access to some of the raw data.

If you’ve got the resources (i.e.: $’s) dealing with #1 one is a matter of scaling. Dealing with #2 is tougher, especially if any of your data sources are not entirely under your control (Cloud, SaaS). The challenge, however, is not insurmountable:

  • Rationalize your requirements and identify what is absolutely critical to your business (i.e.: leave the “it’s just cool” stuff out or defer for later);
  • If you rely on hosted data sources negotiate appropriate access and up-time agreements;
  • Find out if your hosted providers can provide some of the ETL for you;
  • Trim your datasets where possible;
  • Identify your true timing requirements (real-time, near-real-time, periodic);
  • If you have retention / disposition policies on your data sources, enforce them; if not, define and enforce them.

The funny thing is, when I made my living in BI projects, 5 of the 6 points noted above where standard things we did. Maybe nothing really has changed all that much, other than my segue into RIM. Oh well.

%d bloggers like this: