Archive for the ‘Technology Bytes’ Category
The words are Leonardo Da Vinci’s. He first uttered them a very long time ago, but these words should be the mantra for every CIO. Any company’s first step to managing Big Data must be to focus relentlessly on reducing complexity rather than increasing it. That doesn’t sound too profound, except that most IT organizations respond to requests from the business by doing exactly the opposite: by doing something that increases complexity. (If this was a Star Wars movie, this would be known as taking your first step to the Dark Side.)
In previous blog posts we talked about the Big Data version of the Alignment Trap (“Another Alignment Trap?”) and how IT organizations pursuing information management projects can avoid it (“Avoiding Big Data’s Alignment Trap – Part 1”). This post moves beyond perspective to the implementation of the solution and focuses on the elegance and power of simplicity.
Embracing simplicity, it turns out, is actually a hard thing to do. It means:
- Replacing legacy systems where possible.
- Eliminating add-ons.
- Driving consistency and standardization wherever possible.
- Building new solutions on simplified, standardized infrastructure rather than extensive customization or more layering on top of whatever happens to be there.
In the Sloan study, all of the companies that were caught in the Alignment Trap had made this same mistake… they repeatedly took steps that created an enormously complex IT environment. And it was only when they realized what they had done, and had taken steps to unravel the complexity they had created, were they able to finally extract themselves from the Alignment Trap and move towards a more efficient and effective IT environment.
In a previous blog post “Another Alignment Trap?” we talked about the Big Data version of the Alignment Trap (first documented in an MIT Sloan School study). We generated the Big Data version of the diagram presented in the Sloan study, and described Big Data variants of the four quadrants of that diagram:
Status Quo- Little to no value to the business, and the storage and management of that data is expensive. Further, the data is risky to hold, keeping too much of it, and it’s not secure.
Well-oiled Data- Little to no value to the business, but at least it’s well-managed. The data is stored on cheap storage, IT is good at implementing a document retention policy, getting rid of old data, and the rest is well-secured.
The Data Trap- The data is very important to the business, but it’s not well-managed. The data is used in many different areas of the business, but the data is duplicated, moved, not secured, stored and managed in lots of different places (so it costs a lot to manage)
Data-Driven Business Growth- The data is very important to the business, and it’s well applied to critical business problems and opportunities. The data is stored on storage whose cost is appropriate relative to its value and use, old data is deleted, and active, high-value data is retained, and the data that’s kept is well-secured.
To get to the ultimate goal of Data-Driven Business Growth, we agreed it’s important not only to optimize the business value of data, but also to have an IT organization effective at supporting the infrastructure required by these business-oriented information use cases.
Back in 2007, MIT’s Sloan School decided to ask what seems like a basic and reasonable question about IT: Do those organizations that spend more on IT have better business results to show for it?
Of course, the devil is entirely in the details, but they decided to make it simple. To make “spend more on IT” objective and measurable, they used IT budget as a percentage of annual sales. To make “better business results” objective and measurable, they used compound annual growth rate of sales over three years. Then, for the 500-odd companies they surveyed, they plotted the results on a chart.
What they expected to see (and what you would expect to see) was a high degree of correlation: the companies that spend more on IT (compared to the average) had better business results (compared to THAT average), and those that spent less had worse business results. But no. Instead, the results were all over the map. They then wondered if there were other factors that were missing that needed to be considered.
After much additional head-scratching, they concluded that the two other factors were: how well IT was aligned with the business (or not), and how efficacious (how efficient) IT was. It was only then that the pattern (shown below) emerged.
The Sloan Study concluded that companies that had efficient and aligned IT organizations did in fact outperform their peers, both in spectacular revenue growth while underspending on IT. This quadrant they called “IT-Enabled Growth.” They also found that ¾ of those surveyed had inefficient and unaligned IT organizations, and had the mediocre business results to show for it – “Maintenance Zone.” These are the companies Nicholas Carr was referring to when he wrote “IT Doesn’t Matter,” because to these organizations, it doesn’t.
Above and below this correlation, they found two more results that tell an even more interesting story. The “Well-oiled IT” organizations, by being less aligned with the business, didn’t have the business results the “Growth” companies did, but spent 15% less on IT than the average because their IT organization was effective at what it did.
The most interesting quadrant is the “Alignment Trap,” that spent the most on IT relative to the average, but had the worst business results to show for it.
They concluded that the problem was attempting to align IT to the business before the IT organization had its own house in order, thus the businesses failed precisely because they tied the business to an ineffective IT organization. They felt this result was so important that they titled their report “How to Avoid the Alignment Trap.”
Enter Big Data
Many are familiar with the old story about the Blind Men and the Elephant. In various versions of the tale, a group of blind men (or men in the dark) touch an elephant to learn what it is like. Each one feels a different part, but only one part, such as the side or the tusk. They then compare notes.
They conclude that the elephant is like a wall, snake, spear, tree, fan or rope, depending upon where they touch. They have a heated debate that does not come to physical violence, but they learn they are in complete disagreement, and the conflict is never resolved.
Turns out the IT industry had a very similar problem as the six guys above. They each were responsible for different parts of an organization’s elephant, er, IT environment. One of their biggest problems was that each guy, using different tools, or having a different focus, or being responsible for different parts of the process, would end up with different and inconsistent views of what the IT environment really looked like.
So the guys who invented ITIL figured out (correctly, I might add) that the only way out of the problem was to include something called the CMDB. Without getting too technical, a true CMDB is a representation of a set of current and historical relationships between configuration items (the “atoms” of an IT environment). And as long as each of the guys keeps the CMDB up-to-date, nobody ends up being confused.
As readers of this post will know by now, it is my strong belief that we’ll need to adopt solutions to most of the problems associated with big data that looks the way Big Data looks. For example, the solutions themselves should be distributed in the same way that big data itself is distributed, not unlike the way we network large organizations or even the Internet itself.
By contrast, several players in the big data game seem to think we can wrap up solutions in a single container or application, which brings to mind the famous H. L. Mencken quote:
“For every problem, there is one solution which is simple, neat, and wrong.”
A very famous example is shown in the photo above. Hydrogen is simple, neat, and highly flammable — unlike helium, which is what we use today. So, the designers of the Hindenburg did a Mencken. Hydrogen is simple (hey, it’s the first element on the Periodic Table) and neat (certainly quite easy to manufacture). The “wrong” part didn’t become clear until it got too close to its mooring mast at Lakehurst NAS in 1937.
So, what’s the hydrogen in the world of Big Data?
I could give you an answer that’s simple and neat, but it would be wrong.
In 2003, The Economist quoted part-time futurist and full-time visionary William Gibson: “The future is already here – it’s just not evenly distributed.”
William was saying that many of the things we often long for in some yet-unrealized future are in fact already here. But they’re here in such limited quantities and such poorly understood locations that most of us don’t know about them yet.
For example, the picture above is a recent satellite picture of the Korean Peninsula at night – the particular future that involves electric lights (so we don’t bump into stuff at night) clearly hasn’t been evenly distributed to the northern portion of that peninsula yet.
The reason for starting today’s post this way is to try once again to make a similarly visionary statement:
“Big Data is already here – it’s just not evenly distributed.
In the most recent edition of Executive Counsel Magazine, Matthew Nelson authors an article titled “Preventing an E-Discovery Cost Cascade”. In his article, attorney Nelson outlines the limitations of three traditional collection tools, manual collection, network collection, and “index-everything” network collection. Regarding “index everything,” Nelson says:
“Index-everything” network-based collection tools, sometimes called enterprise search technologies, create a searchable data base in advance of a search over an organization’s entire environment. However, the search can take months or even years and is never complete, and thus is both costly and risky.
Sometimes I pine for the ability to tell people in public what Dan Akroyd used to say to Jane Curtain weekly on Saturday Night Live to point out her relatively naïve view of the world. Instead, let’s just say Nelson couldn’t be more wrong. He has obviously limited his thinking to the limited capabilities of his own company’s tools. On the subject of ‘index everything’, I can personally vouch for the fact that not only does it work, it is fast becoming recognized as the only way to address large eDiscovery matters (along with a whole host of other problems associated with Big Data). Let me give one example to drive the point home:
We worked with a customer who started with 5 petabytes of data that needed to be reviewed in an extremely large legal matter. Using an “index everything” approach, that 5 petabytes was reduced to 132 terabytes of potentially relevant data (i.e. reduced to .3% of the initial data set) and ultimately 1 terabyte of data that was turned over as relevant (down to .02% of the initial data set). This was all done in TWO weeks without moving any data anywhere, and the company documented savings in the 10’s of millions of dollars in avoided review costs.
Some of you reading this will recall the day when we first learned that connecting things together, and making them easier to get to, was preferable to moving or copying things about to put them in a convenient place. Examples include client/server (which weaned us from the mainframe as the dominant way that business computing occurred), networked PC’s (which is now so taken for granted that just about every device we imagine wants to be networked, including my car), hyperlinks in web pages (i.e. don’t copy, link), mobile devices, and so on.
Just as we networked desktops, web pages, mobile devices, and applications, we must now seriously contemplate networking data. I will assert that networking lots of data together is what will make it Big Data, not analytics platforms like Hadoop, and not large databases. And this is a really big idea.
“Manage in Place” is why we’ll do it; “Data Networks” is how we’ll do it.
So, if the goal is to create data networks, what would such a data network infrastructure look like? While this is all still fairly new, I think we can safely make a few predictions about what a big data network would look like.
First, working from the ground up, there should be a standard way to access any data source that falls under the heading of Big Data. Think Ethernet/Ethernet connector for Big Data. This doesn’t exist yet, but needs to, so that anyone that builds a data repository of any kind (whether structured or unstructured) can link it into the big data network (and access it remotely) in a consistent, open, but secure way.
Until such a standard way exists, data network solutions will have to have (so vendors of data network components will have to build) custom connectors to each data source that needs to be managed. From an industry perspective this is rather inefficient (since vendors are going to end up writing the same interface code over and over again), but so were proprietary networks until the network industry evolved. But let’s hope, as an industry, we don’t have to live with this proprietary technology for too long.
Space, you see, is just enormous — just enormous. Let’s imagine, for purposes of edification and entertainment, that we are about to go on a journey by rocketship. We won’t go terribly far — just to the edge of our own solar system —but we need to get a fix on how big a place space is and what a small part of it we occupy.
Now the bad news, I’m afraid, is that we won’t be home for supper. Even at the speed of light, it would take seven hours to get to Pluto.
The reason for starting today’s post this way is to try to make a similarly understated but equally understated but unassailable statement (and to try to get credit for saying it first, for all you Google-trained Internet archeologists out there, if the truth needs to be told):
“Big Data, you see, is big – just big.”
Okay, now that I’ve launched myself into Internet immortality, what’s the point of this?
It’s actually pretty simple.
Early in his career, science fiction writer Theodore Sturgeon used have to defend his choice to write science fiction against critics who didn’t believe science fiction to be real literature. Finally, in reply to the accusation that 90% of science fiction in crud, he retorted, “Sure, 90% of science fiction is crud. That’s because 90% of everything is crud.” This reply has become so famous that it’s now called Sturgeon’s Law.
Research also shows that stale and unstructured data runs rampant and the problem only escalates the longer it remains unresolved. We also know that at least 80% of business data is unstructured and 70% of this unstructured data becomes stale 90 days after its creation, and is never touched or used again.
In addition to these staggering statistics, most conservative estimates show that data grows 50% – 100% annually. So, over a 10-year cycle, your data will grow at least by 3,844%.
What, you might reasonably ask, does all this have to do with Information Management or (as we prefer to call it here) Big Data?
Everything, as it turns out, because 90% of your Big Data is crud. And that observation can be both quite daunting but also quite liberating.
Now, we can do a lot with these observations as a starting point (and I will in future blogs), but my topic here is to contrast two popular approaches to information management:
Move to Manage: This is the idea that, in order to do all the things we want to do to this mountain of data (manage it), we have to move it somewhere. We bring the data to the processor, as we did with mainframes.
Manage in Place: This is the idea that we can, in fact, do all the things we want to do with this mountain of data right where it was born, lives, and dies. We move the processor to the data, as we do in the networked, mobile, interconnected, Web 2.0 world. We’ve actually gone through a bit of an evolution and where and how we manage information, especially as it relates to where it has to (or can) reside in order to manage it. In the first generation, we had outsourced solutions, where we actually had to manually move boxes of paper to large warehouses; think Iron Mountain in its first incarnation.
In the next, we had specialty, hardware-based storage solutions. These were very large, very complex, and very expensive; think FileNet, Documentum, or Interwoven.
In the third generation, we evolved away from physical constraints (warehouses and hardware), but still had the idea of a mega-repository, because data’s native environment wouldn’t support the necessary operations; think Autonomy, Enterprise Vault, Simpana, NearPoint. These were far simpler solutions than the first two generations, but still required data to be moved to be managed.
Now, as we’ve already observed, any approach where you have to move data to manage it means that 90% of what you’re moving is crud, and most of that data won’t ever be used again after 90 days. So, why are we moving all this crud, when modern networking technology and protocols make that unnecessary? Good question.
Stop Moving the Data to Manage It
I think it’s a pretty safe conclusion that what we really want to do is not move the data at all, but manage that data in place, where it was created, lives, and dies. Unless, of course, there are specific reasons to do so (like legal retention, disaster recovery, or backup/restore).
Modern Information Management solutions will manage data-in-place, where it is supposed to be managed. No need to spend more money for storage than you already have. It is, after all, your data; don’t you deserve to choose where it resides?