Index Everything or Guess
In the most recent edition of Executive Counsel Magazine, Matthew Nelson authors an article titled “Preventing an E-Discovery Cost Cascade”. In his article, attorney Nelson outlines the limitations of three traditional collection tools, manual collection, network collection, and “index-everything” network collection. Regarding “index everything,” Nelson says:
“Index-everything” network-based collection tools, sometimes called enterprise search technologies, create a searchable data base in advance of a search over an organization’s entire environment. However, the search can take months or even years and is never complete, and thus is both costly and risky.
Sometimes I pine for the ability to tell people in public what Dan Akroyd used to say to Jane Curtain weekly on Saturday Night Live to point out her relatively naïve view of the world. Instead, let’s just say Nelson couldn’t be more wrong. He has obviously limited his thinking to the limited capabilities of his own company’s tools. On the subject of ‘index everything’, I can personally vouch for the fact that not only does it work, it is fast becoming recognized as the only way to address large eDiscovery matters (along with a whole host of other problems associated with Big Data). Let me give one example to drive the point home:
We worked with a customer who started with 5 petabytes of data that needed to be reviewed in an extremely large legal matter. Using an “index everything” approach, that 5 petabytes was reduced to 132 terabytes of potentially relevant data (i.e. reduced to .3% of the initial data set) and ultimately 1 terabyte of data that was turned over as relevant (down to .02% of the initial data set). This was all done in TWO weeks without moving any data anywhere, and the company documented savings in the 10’s of millions of dollars in avoided review costs.
Does that sound like a costly and risky approach? The customer knew everything was examined electronically, had spent significantly less money than was initially budgeted, and was ready for negotiations with opposing counsel before they had even started contracting with a legal service provider to do it the old fashioned way.
Somewhat in contradiction to what Nelson says are the risks and pitfalls of an “index everything” approach, he actually then supports the notion that an index is the only way to go to by exposing the ridiculous nature of his preferred solution.
Specifically, he says:
Next-generation technology solves all these problems by using search indices already created by commonly used business applications. This enables rapid targeted information collection at the point of file creation, thus reducing both the number of irrelevant files collected and later downstream costs of close examination.
The reason this approach fails is completely self-evident: most of the unstructured data that makes up the bulk of an e-discovery collection is in fact not stored in a form where there is a useful, pre-defined index.
But let’s not jump to conclusions here. Since this is an issue we hear frequently from folks who haven’t taken the time or invested in actually architecting solutions for Big Data problems like eDiscovery, maybe we should start at the beginning and take this apart one step at a time, by agreeing on some basic requirements.
Goldilocks was right about collecting the right amount of data.
Even Nelson agrees with Goldilocks about collecting the right amount of data. Collecting too little of the right data is bad, due to the risk of “sanctions and penalties” resulting from the lack of comprehensiveness of data collection and production.
Similarly, collecting too much is bad, because it results in costlier attorney document review.
In fact, I think most of us would agree that you want to collect exactly the right amount, so no false representations are made concerning the completeness of data produced, and incurring no additional, unnecessary expense associated with down-stream processing and review.
Goldilocks was wrong about when.
Nelson tells us the best time to collect the right amount of data:
The number of irrelevant files that are collected can be drastically reduced at the beginning of cases, thus significantly reducing downstream processing and attorney review.
In other words, the sooner in the process you discover that you don’t need a file, the more of the potential excess cost you can avoid. See “saved 10’s of millions of dollars” above.
Goldilocks was right about collecting the right data.
As it happens, the only way to satisfy Goldilocks here is to know as much as you can about all your data, including where it is, who owns it, and what’s in it.
And, not just some of your data (i.e. those index-friendly solutions like Exchange and SharePoint), but all your data, including those pesky laptops, desktops and mobile devices. And the only way to know what’s in that data is to have a full-text index, not just a meta-data index.
The piece of attorney Nelson’s article that frankly left me bewildered was his implied assertion that any judge or opposing counsel would find collection and production of data by “targeting key data sources by leveraging an application’s native index” comprehensive or legally responsive.
Moreover, he seems content to understand very little about the data he’s supposed to produce:
The best next generation collection tools should enable targeted collection by file type, date range, custodian, and keywords.
The simple fact is there is far too much potentially responsive data across the average enterprise that is currently stored in un-indexed form. Moreover, there is far too much potentially responsive data that requires much more than meta-data to fully understand. And none of Nelson’s key data sources do an adequate job of producing full-text indexes of the scope of data required.
Unless one creates a full-text index of that data and keeps that index up to date, there’s responsive data you’re not collecting, producing, and reviewing. Count on it.