Proper Wildcard Searching: Why You Should Give a Dam* – eDiscovery Best Practices
September 25, 2012
When we launched eDiscoveryDaily over two years ago, I relayed a story where I provided search strategy assistance to a client that had already agreed upon several searches with opposing counsel. One search related to mining activities, so the attorney decided to use a wildcard of “min*” to retrieve variations like “mine”, “mines” and “mining”. That one search retrieved over 300,000 files with hits.
Why? Because there are 269 words in the English language that begin with the letters “min”. Words like “mind”, “mingle”, “minimal”, “miniscule” and “minutia” were all being retrieved in this search for files related to “mining”. We ultimately had to go back to opposing counsel and negotiate a revised search that was more appropriate.
Recently, I encountered another client, who was trying to use “dam*” to retrieve variations of “damage” and “damages”. Unfortunately, they also retrieved “dame”, “damp” and, well, “damn”. There are 86 total words in the English language that begin with the letters “dam”. Darn it!
Methods to Retrieve the Correct Wildcard Variations
In that blog post, I talked about the benefits of stem searching (if your application’s search engine supports stem searches) to capture the specific variations of a word (like “mine” or “damage”) and Morewords.com, which shows list of words that begin with your search string. For example, to get all 269 words beginning with “min”, go here. Substitute any characters for “min” in the URL to see the words that start with those characters. Choose the variations you want and incorporate them into the search instead of the wildcard – i.e., use “(mine or “mines or mining)” instead of “min*” to retrieve a more precise result set without sacrificing recall. Personally, I almost never use wildcards – I prefer to identify the variations and just use them, it’s more precise.
Introducing Spelling Variations into the Mix
The above approaches assume that words are spelled correctly in the collection – if they are not, those misspellings won’t be retrieved. Misspellings can include Optical Character Recognition (OCR) errors, where the OCR application fails to render all words read from an image file with 100% accuracy (this is common, especially when the resolution of the image is less than optimal). So, you can get “words” in the collection such as “min1ng” or “MININ6”.
To combat this, you’ll need to identify the variations of the terms you wish to use, then you can use a search tool like CloudNine Discovery’s Early Case Assessment application, (FirstPass®, powered by Venio FPR™), that supports "fuzzy" searching, which is a mechanism by finding alternate words that are close in spelling to the word you're looking for (usually one or two characters off). FirstPass will display all of the words – in the collection – close to the word you’re looking for, so if you’re looking for “mining”, you can find variations such as “min1ng”, “MININ6” or even “minig” – that could be relevant. Then, simply select the variations you wish to include in the search. You’ll need to repeat this for each of the variations of the terms you wish to use, but it will enable you to pick up those misspellings and OCR errors to ensure completeness.
So, what do you think? Do you use wildcards in your searches? Are you sure you’re getting just the terms you want? Please share any comments you might have or if you’d like to know more about a particular topic.Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.