eDiscovery Searching: Proximity, Not Absence, Makes the Heart Grow Fonder
January 14, 2011
Recently, I assisted a large corporate client where there were several searches conducted across the company’s enterprise-wide document management systems (DMS) for ESI potentially responsive to the litigation. Some of the individual searches on these systems retrieved over 200,000 files by themselves!
DMS systems are great for what they are intended to do – provide a storage archive for documents generated within the organization, version tracking of those documents and enable individuals to locate specific documents for reference or modification (among other things). However, few of them are developed with litigation retrieval in mind. Sure, they have search capabilities, but it can sometimes be like using a sledgehammer to hammer a thumbtack into the wall – advanced features to increase the precision of those searches may often be lacking.
Let’s say in an oil company you’re looking for documents related to “oil rights” (such as “oil rights”, “oil drilling rights”, “oil production rights”, etc.). You could perform phrase searches, but any variations that you didn’t think of would be missed (e.g., “rights to drill for oil”, etc.). You could perform an AND search (i.e., “oil” AND “rights”), and that could very well retrieve all of the files related to “oil rights”, but it would also retrieve a lot of files where “oil” and “rights” appear, but have nothing to do with each other. A search for “oil” AND “rights” in an oil company’s DMS systems may retrieve every published and copyrighted document in the systems mentioning the word “oil”. Why? Because almost every published and copyrighted document will have the phrase “All Rights Reserved” in the document.
That’s an example of the type of issue we were encountering with some of those searches that yielded 200,000 files with hits. And, that’s where proximity searching comes in. Proximity searching is simply looking for two or more words that appear close to each other in the document (e.g., “oil within 5 words of rights”) – the search will only retrieve the file if those words are as close as specified to each other, in either order. Proximity searching helped us reduce that collection to a more manageable number for review, even though the enterprise-wide document management system didn’t have a proximity search feature.
How? We wound up taking a two-step approach to get the collection to a more likely responsive set. First, we did the “AND” search in the DMS system, understanding that we would retrieve a large number of files, and exported those results. After indexing them with a first pass review tool that has more precise search alternatives (at Trial Solutions, we use FirstPass™, powered by Venio FPR™, for first pass review), we performed a second search on the set using proximity searching to limit the result set to only files where the terms were near each other. Then, tested the results and revised where necessary to retrieve a result set that maximized both recall and precision.
The result? We were able to reduce an initial result set of 200,000 files to just over 5,000 likely responsive files by applying the proximity search to the first result set. And, we probably saved $50,000 to $100,000 in review costs – on a single search.
I also often use proximity searches as alternatives to phrase searches to broaden the recall of those searches to identify additional potentially responsive hits. For example, a search for “Doug Austin” doesn’t retrieve “Austin, Doug” and a search for “Dye 127” doesn’t retrieve “Dye #127”. One character difference is all it takes for a phrase search to miss a potentially responsive file. With proximity searching, you can look for these terms close to each other and catch those variations.
So, what do you think? Do you use proximity searching in your culling for review? Please share any comments you might have or if you’d like to know more about a particular topic.