Proximity Searches Can Be the Right Balance of Recall and Precision – eDiscovery Best Practices
October 08, 2012
When performing keyword searching, the challenge to performing those searches effectively is to balance recall (retrieving responsive documents with hits) and precision (not retrieving too many non-responsive documents with hits). A search that has 100% precision will contain only responsive documents; however, that does not mean that all of the responsive documents have been retrieved. A search that has 100% recall will contain all of the responsive documents in the collection; however, it may also contain a large number of non-responsive documents, which can be drive up review costs. So, how to perform searches that effectively balance recall and precision?
One way is through proximity searching. Proximity searching is simply looking for two or more words that appear close to each other in the document. It’s more precise than an AND search (i.e., termA and termB) with more recall than a phrase search (i.e., “termA termB”). Let’s look an example.
You’re working for an oil company and you’re looking for documents related to “oil rights” (such as “oil rights”, “oil drilling rights”, “oil production rights”, etc.). You could perform phrase searches, but any variations that you didn’t think of would be missed (e.g., “rights to drill for oil”, etc.). You could perform an AND search (i.e., “oil” AND “rights”), and that could very well retrieve all of the files related to “oil rights”, but it would also retrieve a lot of files where “oil” and “rights” appear, but have nothing to do with each other. A search for “oil” AND “rights” throughout various oil company’s data stores may retrieve several published and copyrighted documents that mention the word “oil”, but have nothing to do with “oil rights”. Why? Because almost every published and copyrighted document will have the phrase “All Rights Reserved” in the document, so those will be retrieved, even though many of them will likely be non-responsive.
A proximity search like “oil within 5 words of rights” will only retrieve the document if those words are as close as specified to each other, in either order. Proximity searching helps reduce the result set to a more manageable number for review, by eliminating all of the files that happen to mention “oil” and “rights” somewhere in the document, but not in context with each other. Yet, it catches all of the variations of phrases containing “oil” and “rights” for which you may not think to search.
Proximity searches are great for searching people’s names, as well. For example, a phrase search for “John Adams” won’t retrieve “Adams, John”, but a proximity search for “John within 3 words of Adams” will retrieve “John Adams”, “Adams, John”, and even “John Q. Adams”.
When developing a search of two or more related words that effectively balances recall and precision, consider using a proximity search. It just might be the right search for the situation.
So, what do you think? Do you use proximity searching to make your searches more effective? Please share any comments you might have or if you’d like to know more about a particular topic.Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.