Subscription Center

Sign up to receive eDiscovery Daily's articles via email or add the RSS feed to your newsreader of choice.

  • RSS Feed


Browse eDiscovery Daily Blog

About the Bloggers

Brad Jenkins

Brad Jenkins, President and CEO of CloudNine Discovery, has over 20 years of experience leading customer focused companies in the litigation support arena. Brad has authored many articles on litigation support issues, and has spoken before national audiences on document management practices and solutions.

Doug Austin

Doug Austin, Professional Services Manager for CloudNine Discovery, has over 20 years experience providing legal technology consulting and technical project management services to numerous commercial and government clients. Doug has also authored several articles on eDiscovery best practices.

Jane Gennarelli

Jane Gennarelli is a principal of Magellan’s Law Corporation and has been assisting litigators in effectively handling discovery materials for over 30 years. She authored the company’s Best Practices in a Box™ content product and assists firms in applying technology to document handling tasks. She is a known expert and often does webinars and presentations for litigation support professionals around the country. Jane can be reached by email at jane@litigationbestpractices.com.

eDiscovery Best Practices: For Successful Predictive Coding, Start Randomly

August 20, 2012

By Doug Austin


Predictive coding is the hot eDiscovery topic of 2012, with three significant cases (Da Silva Moore v. Publicis Groupe, Global Aerospace v. Landow Aviation and Kleen Products v. Packaging Corp. of America) either approving or considering the use of predictive coding for eDiscovery.  So, how should your organization begin when preparing a collection for predictive coding discovery?  For best results, start randomly.

If that statement seems odd, let me explain. 

Predictive coding is the use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection.  That subset of the collection is often referred to as the “seed” set of documents.  How the seed set of documents is derived is important to the success of the predictive coding effort.

Random Sampling, It’s Not Just for Searching

When we ran our series of posts (available here, here and here) that discussed the best practices for random sampling to test search results, it’s important to note that searching is not the only eDiscovery activity where sampling a set of documents is a good practice.  It’s also a vitally important step for deriving that seed set of documents upon which the predictive coding software learning decisions will be made.  As is the case with any random sampling methodology, you have to begin by determining the appropriate sample size to represent the collection, based on your desired confidence level and an acceptable margin of error (as noted here).  To ensure that the sample is a proper representative sample of the collection, you must ensure that the sample is performed from the entire collection to be predictively coded.

Given the debate in the above cases regarding the acceptability of the proposed predictive coding approaches (especially Da Silva Moore), it’s important to be prepared to defend your predictive coding approach and conducting a random sample to generate the seed documents is a key step to defensibility of that approach.

Then, once the sample is generated, the next key to success is the use of a subject matter expert (SME) to make responsiveness determinations.  And, it’s important to conduct a sample (there’s that word again!) of the result set after the predictive coding process to determine whether the process achieved a sufficient quality in automatically coding the remainder of the collection.

So, what do you think?  Do you start your predictive coding efforts “randomly”?  You should.  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.



  • August 22, 2012 Craig Ball

    "How would you handle large collections with unevenly distributed content?"

    I'm not certain, which is one reason I raise the issue here. As a statistician, I'm a good lawyer. ;-) Modeling what what I'd do from a blog post in the abstract is going to be pretty hard to defend on the stand--hard enough for me, absurd for the reader.

    On the sampling front, a prominent commentator recently shared the following (I don't have permission to name this person, so please forgive the failure to attribute properly):

    "A...source of confusion appears to be the distinction between random seed set construction and random sampling for validation of effectiveness. If you see some blogger say "I chose a seed set of 1507 documents, because that gave me plus-or-minus 3% with 95% confidence," you can be sure they have no idea what they are talking about. "Plus-or-minus" (margin of error) and confidence [level] are terms of art that pertain to validation, not training set selection. If you are going to use sampling for validation, it absolutely has to be random. And it absolutely has to be separated by a firewall from any sampling you do for the purpose of creating a training set. Otherwise any validation derived from the sample is potentially biased and therefore unsound."

    I thought this powerfully put.

  • August 20, 2012 Doug Austin

    Craig, thanks for the comment. If the post is an oversimplification, I apologize, but I was basically talking about using confidence level and margin of error to determine an appropriate sample size in each case (whether securing the training set or testing the quality of the outcome).

    As for sampling in a large collection with skewed or unevenly distributed content (whether due to selective culling or not), I can relate one case I worked on as an example (not a predictive coding scenario, but a collection testing scenario). We had three distinctly different groups of documents where the sizes were very disproportionate (i.e., approximately 60%, 30% and 10% of the collection respectively, with the 10% group deemed to be the most likely responsive group based on source). In that case, we decided it was appropriate to perform a stratified sample where we sampled each group independently to ensure that each group was sufficiently covered (especially the most likely responsive 10% group) and make appropriate decisions on each group.

    That’s what we did in that case. How would you handle large collections with unevenly distributed content?

  • August 20, 2012 Craig Ball

    When you speak in terms of confidence levels and error rates in the context of securing the training set (as distinguished from testing the quality of the outcome), are you conflating two very different things?

    Also, in terms of skewing the results, how do you address the impact of using selective culling based on keywords to populate the collection being sampled (or preserved)?

    Though I agree with the core principle you put forward ("start randomly"), we need to address (in plain language) how random sampling works in practice in very large collections with unevenly distributed responsive content.

What Do You Think?

Please comment on the above article.

Name (required)
Email Address (required, but won’t be published)
Web Address (optional) Remember My Information
TypeKey/TypePad Login (optional)