Disclaimer: I am not a big data specialist. Not even close. So please do not use this blog as a source for understanding big data or big data analytics.
What I am, though, is a big data aficionada. And you can use this blog as a source for understanding how pumped anyone can get about big data!
I had dinner the other night with longtime friend who has been out of the IT industry for a decade or so. She asked me what big data is (I so want to say "what big data are", but I resist…), and before I was a few minutes into it, her eyes opened wide and she said, "So instead of starting with a hypothesis and collecting data to test it, you look at huge masses of data for hypotheses that can emerge from it." She also said, "it's like the difference between how adults learn language from grammar books and how children learn it from everything around them". Then she started riffing on all the potential applications. It was a blast to watch her mind just take flight.
A somewhat more formal, but no less energized, conversation took place at VERGE DC earlier this month. I'd been graciously invited by the folks at GreenBiz to host a guru session on big data and sustainability. Clearly, it made sense to have someone from EMC, given what EMC is doing with our Greenplum and Isilon products. For the sustainability aspect, it made clear sense for the facilitator to be me. But for big data, not so much. On the other hand, the whole domain of sustainability involves so many diverse data sources that it inherently seemed like a match, so I figured - what the heck!
It turned out wonderfully, as it happens, because I had an awesome set of participants. They came from private industry - large and startup, IT and not - as well as government, academia, and non-profits. And they brought exceptional expertise, passion, and ideas. This was a great example of how we're going to have to collaborate going forward with both subject matter experts and data scientists working together, and I'd wished I could have told them about EMC Greenplum Chorus, but that was still a couple of days away from announcement.
Most of what we talked about is the definition of "big data". We didn’t come to a definitive answer (we only had an hour, after all, not including the cocktail-fueled discussion that followed), but we did compile a list of attributes that one or another of the attendees felt were typical for big data:
- There's a lot of it (duh) - too much for traditional tools.
- It's growing fast - again, too fast for traditional tools.
- "It's noise", one person said. Which was interpreted two different ways - one to mean that it has a low signal-to-noise ratio, the other (more inspiring) that the patterns in the noise itself may turn out to be signal when looked at in sufficient quantity.
- Algorithms can be created from the data, rather than the other way 'round.
- Unexpected things can emerge from big data (as our own guru Chuck points out, "if it's done right!").
- It comes from disparate sources and disparate kinds of sources.
- It's largely unstructured, and generally not cleansed, harmonized, or normalized. Messy.
- Access is potentially global.
- It is cross-domain and multi-domain in nature.
- It doesn't necessarily contain answers; it describes or even recreates the context of the problem and solution space. That is, it exposes correlations that can lead to answers, before we figure out causality.
- It is context-dependent and has potentially narrow temporal thresholds.
Then, we started to talk about applications. We spent most of the remaining time on a particular challenge: how to stop the largest epidemic of cholera in the world. As it was described to me, traditional methods of tracking let us find people who have died and those that are infected so we can treat them. But with big data, we could be looking at everything from social networks to migration patterns from cell phone roaming to water conditions to figure out where it's going and get there before big outbreaks become inevitable. There's plenty of work to do here, but it is really starting to happen! And yes, it is another public/private/academic collaboration.
Meanwhile, back at the ranch, we're now looking at several opportunities to apply big data analytics to our own sustainability initiatives, from understanding impacts to measuring performance. Stay tuned - I will certainly share what we learn.
Students always ask me what they should study if they want to make the world a better place. How about data science? Anyone who reads my blog knows how much I love my job, and how much satisfaction I get from working with the team at EMC to drive change. I wouldn’t trade it for anything.
But nothing has me as pumped as big data. It can change the world!