Archive for the ‘Technology Bytes’ Category

The Blind Men and the Elephant

Many are familiar with the old story about the Blind Men and the Elephant.  In various versions of the tale, a group of blind men (or men in the dark) touch an elephant to learn what it is like. Each one feels a different part, but only one part, such as the side or the tusk. They then compare notes.

They conclude that the elephant is like a wall, snake, spear, tree, fan or rope, depending upon where they touch. They have a heated debate that does not come to physical violence, but they learn they are in complete disagreement, and the conflict is never resolved.

There’s an even more recent version of this story, but it involves IT Service Management.  This story ends happily because the six men decide to rely on a Configuration Management Database, or CMDB.

Turns out the IT industry had a very similar problem as the six guys above.  They each were responsible for different parts of an organization’s elephant, er, IT environment.  One of their biggest problems was that each guy, using different tools, or having a different focus, or being responsible for different parts of the process, would end up with different and inconsistent views of what the IT environment really looked like.

So the guys who invented ITIL figured out (correctly, I might add) that the only way out of the problem was to include something called the CMDB.  Without getting too technical, a true CMDB is a representation of a set of current and historical relationships between configuration items (the “atoms” of an IT environment).   And as long as each of the guys keeps the CMDB up-to-date, nobody ends up being confused.

Comments(0)

“Simple, neat, and wrong.”

As readers of this post will know by now, it is my strong belief that we’ll need to adopt solutions to most of the problems associated with big data that looks the way Big Data looks.  For example, the solutions themselves should be distributed in the same way that big data itself is distributed, not unlike the way we network large organizations or even the Internet itself.

By contrast, several players in the big data game seem to think we can wrap up solutions in a single container or application, which brings to mind the famous H. L. Mencken quote:

“For every problem, there is one solution which is simple, neat, and wrong.”

A very famous example is shown in the photo above.  Hydrogen is simple, neat, and highly flammable — unlike helium, which is what we use today.  So, the designers of the Hindenburg did a Mencken.   Hydrogen is simple (hey, it’s the first element on the Periodic Table) and neat (certainly quite easy to manufacture).  The “wrong” part didn’t become clear until it got too close to its mooring mast at Lakehurst NAS in 1937.

So, what’s the hydrogen in the world of Big Data?

I could give you an answer that’s simple and neat, but it would be wrong.

Rather, there are several different types of hydrogen in use today in the world of Big Data, so let’s talk about several of them.

Comments(0)

Not Evenly Distributed

“The future has already arrived. It’s just not evenly distributed yet.”

In 2003, The Economist quoted part-time futurist and full-time visionary William Gibson: “The future is already here – it’s just not evenly distributed.”

William was saying that many of the things we often long for in some yet-unrealized future are in fact already here.  But they’re here in such limited quantities and such poorly understood locations that most of us don’t know about them yet.

For example, the picture above is a recent satellite picture of the Korean Peninsula at night – the particular future that involves electric lights (so we don’t bump into stuff at night) clearly hasn’t been evenly distributed to the northern portion of that peninsula yet.

The reason for starting today’s post this way is to try once again to make a similarly visionary statement:

“Big Data is already here – it’s just not evenly distributed.

What’s the point of this? There are two points, and they’re both actually pretty simple. 

Comments(0)

Index Everything or Guess

In the most recent edition of Executive Counsel Magazine, Matthew Nelson authors an article titled “Preventing an E-Discovery Cost Cascade”.  In his article, attorney Nelson outlines the limitations of three traditional collection tools, manual collection, network collection, and “index-everything” network collection.  Regarding “index everything,” Nelson says:

“Index-everything” network-based collection tools, sometimes called enterprise search technologies, create a searchable data base in advance of a search over an organization’s entire environment. However, the search can take months or even years and is never complete, and thus is both costly and risky.

Sometimes I pine for the ability to tell people in public what Dan Akroyd used to say to Jane Curtain weekly on Saturday Night Live to point out her relatively naïve view of the world.  Instead, let’s just say Nelson couldn’t be more wrong.  He has obviously limited his thinking to the limited capabilities of his own company’s tools. On the subject of ‘index everything’, I can personally vouch for the fact that not only does it work, it is fast becoming recognized as the only way to address large eDiscovery matters (along with a whole host of other problems associated with Big Data).  Let me give one example to drive the point home:

We worked with a customer who started with 5 petabytes of data that needed to be reviewed in an extremely large legal matter.  Using an “index everything” approach, that 5 petabytes was reduced to 132 terabytes of potentially relevant data (i.e. reduced to .3% of the initial data set) and ultimately 1 terabyte of data that was turned over as relevant (down to .02% of the initial data set). This was all done in TWO weeks without moving any data anywhere, and the company documented savings in the 10’s of millions of dollars in avoided review costs.

Comments(0)

The Rise of the Data Network

You mean, we can learn a lesson about Big Data from the Internet?

Some of you reading this will recall the day when we first learned that connecting things together, and making them easier to get to, was preferable to moving or copying things about to put them in a convenient place.  Examples include client/server (which weaned us from the mainframe as the dominant way that business computing occurred), networked PC’s (which is now so taken for granted that just about every device we imagine wants to be networked, including my car), hyperlinks in web pages (i.e. don’t copy, link), mobile devices, and so on.

Just as we networked desktops, web pages, mobile devices, and applications, we must now seriously contemplate networking data.  I will assert that networking lots of data together is what will make it Big Data, not analytics platforms like Hadoop, and not large databases.  And this is a really big idea.

“Manage in Place” is why we’ll do it; “Data Networks” is how we’ll do it.

So, if the goal is to create data networks, what would such a data network infrastructure look like?  While this is all still fairly new, I think we can safely make a few predictions about what a big data network would look like.

First, working from the ground up, there should be a standard way to access any data source that falls under the heading of Big Data.  Think Ethernet/Ethernet connector for Big Data.  This doesn’t exist yet, but needs to, so that anyone that builds a data repository of any kind (whether structured or unstructured) can link it into the big data network (and access it remotely) in a consistent, open, but secure way.

Until such a standard way exists, data network solutions will have to have (so vendors of data network components will have to build) custom connectors to each data source that needs to be managed.  From an industry perspective this is rather inefficient (since vendors are going to end up writing the same interface code over and over again), but so were proprietary networks until the network industry evolved.  But let’s hope, as an industry, we don’t have to live with this proprietary technology for too long.

Comments(0)

“Space, you see, is just enormous – just enormous.”

In Bill Bryson’s wonderful book A Short History of Nearly Everything, he writes in his usual understated but unassailable style:

Space, you see, is just enormous — just enormous. Let’s imagine, for purposes of edification and entertainment, that we are about to go on a journey by rocketship. We won’t go terribly far — just to the edge of our own solar system —but we need to get a fix on how big a place space is and what a small part of it we occupy.

Now the bad news, I’m afraid, is that we won’t be home for supper. Even at the speed of light, it would take seven hours to get to Pluto.

The reason for starting today’s post this way is to try to make a similarly understated but equally understated but unassailable statement (and to try to get credit for saying it first, for all you Google-trained Internet archeologists out there, if the truth needs to be told):

“Big Data, you see, is big – just big.”

Okay, now that I’ve launched myself into Internet immortality, what’s the point of this?

It’s actually pretty simple.

Comments(0)

90% of Everything is Crud

Early in his career, science fiction writer Theodore Sturgeon used have to defend his choice to write science fiction against critics who didn’t believe science fiction to be real literature.  Finally, in reply to the accusation that 90% of science fiction in crud, he retorted, “Sure, 90% of science fiction is crud. That’s because 90% of everything is crud.” This reply has become so famous that it’s now called Sturgeon’s Law.

Research also shows that stale and unstructured data runs rampant and the problem only escalates the longer it remains unresolved.  We also know that at least 80% of business data is unstructured and 70% of this unstructured data becomes stale 90 days after its creation, and is never touched or used again.

In addition to these staggering statistics, most conservative estimates show that data grows 50% – 100% annually.   So, over a 10-year cycle, your data will grow at least by 3,844%.

Who Cares? 

What, you might reasonably ask, does all this have to do with Information Management or (as we prefer to call it here) Big Data? 

Everything, as it turns out, because 90% of your Big Data is crud.  And that observation can be both quite daunting but also quite liberating.

Now, we can do a lot with these observations as a starting point (and I will in future blogs), but my topic here is to contrast two popular approaches to information management:

Move to Manage:  This is the idea that, in order to do all the things we want to do to this mountain of data (manage it), we have to move it somewhere. We bring the data to the processor, as we did with mainframes.

Manage in Place:  This is the idea that we can, in fact, do all the things we want to do with this mountain of data right where it was born, lives, and dies.  We move the processor to the data, as we do in the networked, mobile, interconnected, Web 2.0 world. We’ve actually gone through a bit of an evolution and where and how we manage information, especially as it relates to where it has to (or can) reside in order to manage it.  In the first generation, we had outsourced solutions, where we actually had to manually move boxes of paper to large warehouses; think Iron Mountain in its first incarnation.

In the next, we had specialty, hardware-based storage solutions.  These were very large, very complex, and very expensive; think FileNet, Documentum, or Interwoven.

 
In the third generation, we evolved away from physical constraints (warehouses and hardware), but still had the idea of a mega-repository, because data’s native environment wouldn’t support the necessary operations; think Autonomy, Enterprise Vault, Simpana, NearPoint.  These were far simpler solutions than the first two generations, but still required data to be moved to be managed.

Now, as we’ve already observed, any approach where you have to move data to manage it means that 90% of what you’re moving is crud, and most of that data won’t ever be used again after 90 days.  So, why are we moving all this crud, when modern networking technology and protocols make that unnecessary?  Good question.

Stop Moving the Data to Manage It 

I think it’s a pretty safe conclusion that what we really want to do is not move the data at all, but manage that data in place, where it was created, lives, and dies.  Unless, of course, there are specific reasons to do so (like legal retention, disaster recovery, or backup/restore).

Modern Information Management solutions will manage data-in-place, where it is  supposed to be managed.  No need to spend more money for storage than you already have.  It is, after all, your data; don’t you deserve to choose where it resides?

Comments(0)

Seven strategies for efficiently managing ‘millions of moving parts’

When people talk about petabytes or exabytes of data, they make it sound like one big homogenous collection of data.  This misunderstanding is further advanced by the idea (being floated by many) that the problem will be solved with a ‘mega-repository’ approach of collecting it all and putting it in one place. 

Centralization strategies like this are doomed to fail. 

We have a long history in IT of watching new categories of management technologies emerge.  We always follow the same course:  it starts with centralization, evolves to mass anarchy, finally culminating in a distributed management solution.  We saw this in databases, network/systems management, cloud-based services, and CRM.  We will see it again in Big Data Management.  

For companies wanting to get out in front of the Big Data problem, the key to success is to implement a ‘smart switch’ solution, like Cisco brought to the network world.  The net effect will be to have an ‘always on’ infrastructure capable of discovering, analyzing, and acting on data in real-time. 

Companies looking to embrace this approach will need to employ 7 new strategies for embracing the realities of managing and using Big Data.  These are the principle design goals for our “Big Data Switch:”

  1. Understand your data topology
  2. Leave data where it is
  3. Employ real-time indexing of the full environment
  4. Store the intelligence about your data (metadata)
  5. Create an Information Intelligence Service center
  6. Classify automatically and continuously, enforce policies proactively.
  7. Employ change management disciplines to stay current

Understanding your data topology just means you need to understand where important data is created and resides in your environment.  For many just starting out, this will be challenge enough (as most organizations have no idea where their data actually is).

Once you know where your data is, you might decide to try to centralize it.  And while you might decide to move some of it around, in most cases you’ll want to leave it where it is – stale data is useless, and the most effective way to keep data fresh and relevant is to keep it close to where it’s created and used.

You’ll then need to index that data so you know what you have.  Since data is almost continuously being created, deleted, and updated, you’ll need to index your data in real-time – stale metadata is as useless as stale data.

You’ll then need to store this “data about data” someplace where it can be most useful.  You can expect to centralize this metadata at first, but long-term we’ll figure out how to federate it (as we did with DNS and SS8).

The best way to hide the topology of your metadata is to create an Information Intelligence Service – thus, you can hide the actual topology of your metadata from the various uses of that metadata as that topology evolves.

The most common application of our IIS is classification and policy.  The volumes of data will require that classification be continuous and automated, and that policies applied to the data being classified be proactive – data will evolve too quickly to try to do it after-the-fact.

Because the data guiding these indexing, classification, and policy mechanisms are critical to managing Big Data, this so-called “control data” must be placed under change control – stale control data is even more useless than stale data and metadata.

We’ll be talking more about each of these strategies in future posts, but you can rest assured that the only way to effectively manage millions of moving parts of data is to apply these seven strategies.

Comments(0)