Last minute geek

last minute tech news from around the net

Friday, Nov 16th

Last update03:38:00 AM

You are here: English CircleID Why Government Agencies Use Ugly, Difficult to Use Scanned PDFs - There's More Than Meets the Eye

Why Government Agencies Use Ugly, Difficult to Use Scanned PDFs - There's More Than Meets the Eye

User Rating: / 0

Sometimes, a government agency will post a PDF that doesn't contain searchable text. Most often, it's a scan of a printout. Why? Don't the NSA, the Department of Justice, etc., know how to convert Word (or whatever) directly to PDF? It turns out that they know more than some of their critics do. The reason? With a piece of paper, you know much more about what you're actually disclosing.

It's tempting to think of a PDF file as a simple image of a page, or maybe a simple page image with — somehow! — embedded text that you can search for. In fact, PDFs are far more complex than that. A PDF file (or more or less any modern document file) is a container that can hold many different types of things: text, images, fonts definitions, JavaScript programs (yes, you can embed JavaScript in PDF), and much more. If you release a PDF produced by a text formatter, do you really know what you're releasing?

It may be possible to strip all of the metadata safely. The NSA, in fact, has a guide on how to do it. (N.B. You'll get a certificate error: many US government agencies have certificates from a US government-specific certificate authority, and outside browsers do not trust it by default. If you do not want to click through the warning messages (if you even can), I've created a mirror of it. And that's legal: by law, US government-created documents are in the public domain.) But the complexity is worrisome — and the list of things that "Sanitize Document" can delete (page 10) is quite amazing. (Sanitizing Word is harder.)

So why is this an issue? Well, people still get it wrong. And it's not a new problem; Bruce Schneier wrote about it years ago and said it was barely newsworthy then. Even, yes, Federal prosecutors can get it wrong.

Printing things onto paper and scanning it is ugly and not as functional, but it does prevent this sort of error.

And there are two more subtle points. First, sensitive networks are often air-gapped from the Internet. Air-gapping — having no physical connection whatsoever to the outside world — is a strong defense, though far from perfect. Getting a PDF file from an air-gapped network to the Internet can be done, but it's painstaking and — if done incorrectly — can expose the sensitive network to attack from the outside. Again, we know how to do this — follow NSA procedures on the sensitive network, burn a CD-R (not a CD-RW) with just the PDF, and carry that to an outside machine — but there's still the chance for human error. And there's one more threat…

What is really in a PDF, and how do you know? Is it just what you see on the screen? Even apart from malice or stupidity, e.g., setting the font color to white, there's a hidden danger: what did the PDF creation or redaction program actually write out? Remember that PDFs are containers; there can be nominally empty sections of the file. What fills those bytes? How do you know, and what is your assurance?

Many years ago, while I was at AT&T, I was working on an important internal project. Someone sent out a Word document with some very sensitive details. Unlike everyone else on the project, I was running an open source OS instead of Windows, so I couldn't just fire up Word. Instead, I used an open source tool to view the file — and I saw something different. The person who created the file had two documents open in Word, and what was nominally empty space was filled with whatever garbage was lying around RAM at the time: in this case the body of an unrelated letter he was sending to someone outside the company. The tool I used to view the file wasn't perfect, so it printed the wrong part of the Word document. The odds are high, of course, that the recipient of that letter received some of our project plans, but if that person did the usual — run Windows and Word — it would never appear, and our corporate secrets would be safe.

The NSA and the Department of Justice, of course, have serious adversaries, ones who won't take a file at face value. Unless you have a lot of confidence in the PDF redaction program, you're much better off scanning a printed version. Sure, there are still some risks, e.g., steganography based on kerning or the like, but they're much less than with a PDF.

So: DoJ has its reasons for sending out these difficult-to-use PDFs. You may not like it — I don't like it — but they're doing it out of caution, not ignorance or stupidity.

Written by Steven Bellovin, Professor of Computer Science at Columbia University

Follow CircleID on Twitter

More under: Cybersecurity, Web

Read all
Comment Policy:
We pre-moderate any comments and welcome all kinds of thoughts, supportive, dissenting, critical or otherwise. We delete or censor comments that are:

* abusive
* off-topic
* contain personal attacks, or against any company or organization
* promote hate of any kind
* use excessively foul language
* is blatantly spam or advertising

We do not discriminate based on the person who is posting, and we never censor comments for political or ideological reasons. We never delete an appropriate comment because we disagree with its viewpoint or ideology, and we never publish an inappropriate comment because we agree with or support its viewpoint or ideology.

Attention spammers: we manually approve all comments. Spamming and blatant advertising will NOT be published on this site and is deleted immediately, you've been warned, do not waste your time here.

Add comment

Security code