eDiscovery Best Practices: Perspective on the Amount of Data Contained in 1 Gigabyte
March 05, 2012
Often, the picture used to introduce the blog post is a whimsical (but public domain!) representation of the topic at hand. However, today’s picture is intended to be a bit instructional.
As we work with more data daily and we keep buying larger hard drives to store that data, one gigabyte (GB) of data seems smaller and smaller. Today, you can buy a portable 1 terabyte (TB) drive for less than $100 in some places. Is the GB smaller than it used to be? Last I checked, it’s still about a billion bytes (1024 x 1024 x 1024 or 1,073,741,824 bytes, to be exact).
From a page standpoint, most estimates that I’ve heard have estimated 1 GB to be 50,000 to 75,000 pages. Of course, that can vary widely, depending on the file types comprising that GB. A GB of 1 megabyte (MB) one-page, high-resolution image files will only take about 1,000 pages to equal a GB, whereas a collection of 5 kilobyte (KB) text file and small emails (with minimal attachments) could take as much as 200,000 pages to equal a GB. So, 50,000 to 75,000 is probably a good average.
A ream of copy paper is 500 pages and a case holds 10 reams (5,000 pages). So, a GB is the equivalent of 100 to 150 reams of paper (10 to 15 cases), which is enough paper to fill a small truck. Hence, today’s picture shows a truck full of paper.
There was a Gartner report that re-published Anne Kershaw’s analysis on the cost to manually review 1 TB of data. Quoting from the report, as follows:
“Considering that one terabyte is generally estimated to contain 75 million pages, a one-terabyte case could amount to 18,750,000 documents, assuming an average of four pages per document. Further assuming that a lawyer or paralegal can review 50 documents per hour (a very fast review rate), it would take 375,000 hours to complete the review. In other words, it would take more than 185 reviewers working 2,000 hours each per year to complete the review within a year. Assuming each reviewer is paid $50 per hour (a bargain), the cost could be more than $18,750,000.”
If it costs $18.75 million to review 1 TB, one could extrapolate that to approximately $18,750 to review each GB. Dividing by 1,000 (ignoring the 24), that extrapolates to: 75,000 pages / 4 = 18,750 documents / 50 documents reviewed per hour = 375 review hours x $50 per hour = $18,750. I’ve mentioned that figure to clients and prospects and they almost always seem surprised that the figure is so high. Then, I ask them how many hours does it take them to review a truckload of paper to determine relevancy to the case? ;-)
Bottom line: each GB effectively culled out through technology (such as early case assessment, first pass review tools like FirstPass™, powered by Venio) can save approximately $18,750 in review costs. That’s why technology based assisted review approaches have become so popular and why it’s important to remember how expensive each additional GB can be.
So, what do you think? Did you realize that each GB was so large or so expensive? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.



Craig, I appreciate the comments. My goal in writing this post was to simply make people aware that each GB could be more data than they think and pure manual review (with no technology culling) of each GB could cost quite a bit more than they think. I used numbers published in a Gartner report from 3 or 4 years ago and extrapolated downward (actually, I think 50,000 to 75,000 pages in a GB is more likely than 75 million pages in a TB). Unless that TB consists entirely of spreadsheets, maybe. :-)
More and more attorneys have come to realize the benefits of technology in assisting with review (regardless of what acronym you wish to use), but many still have not, even with regard to straightforward culling technques such as HASH-based de-duping. This post was written to those people to hopefully give some idea how expensive such a strategy could be.
My intent was to illustrate that, even on a per GB basis, manual review is expensive and each GB saved makes a significant difference. I tried to use some numbers to do so, though I recognize those numbers are subject to debate. If there are any other studies you know of that reflect different thoughts on pages per GB, review rates or any part of the equation, let me know. I would love to do a follow up post with input from other sources! Thanks!
My apologies. I had already found the picture, but once I enlarged it, I realized it wasn't reams of paper (like I thought it was). At 9:30 last night, when I was getting ready to publish this post, I was unable to find a suitable replacement. Maybe I need to get about 10-15 reams of paper and a pickup truck and make my own picture!
Every time I see one of these extrapolations, they give me a knot in my stomach because I know that some poor soul will read them and think there's a possibility they may be accurate when these sort of inference stacked upon inference calculations always lead to irrational outcomes. No matter how easily the numbers dance, they are still more fact than fiction.
Let's start with the proposition that we face cases every day where the volume of digital data encountered is larger than a gigabyte or a terabyte. Yet, sane people are NOT spending 375,000 hours or $18.75 million to get through the data. That should serve as reality check #1.
Reality check #2 is that a one gigabyte e-mail container file doesn't routinely throw off 75,000 pages of unique messaging. Could it? Sure. But does it? Not in reality.
Reality check #3 is the need to keep in mind that the level of individual human productivity has not grown by leaps and bounds in the last decade despite the explosion in digital storage volume. That volume is mostly due to to exponential replication and fragmentation (i.e., yesteryear's single business letter is now twenty messages copied to five people). If the reply simply says, "OK," we count it as a page in analyses like this one.
So, the next time someone says, "I have a terabyte of stuff on my machine, ask yourself, "Have they really generated 75 million pages of documents demanding review?" The more sensible question might be, "Does anyone generate 75,000,000 pages of documents in a lifetime?"
"Hence, today’s picture shows a truck full of paper." Really, looks more like a truck full of dry chemicals based on the industrial background and packing method of the pallettes...