Why isn't my PDF file being imported?


PDF has been referred to as a "write-only" file format1, since it's so difficult for applications to extract useful information from it. A PDF file is only required to contain what's necessary to reproduce the visual representation of a document, but it notably isn't required that there's any orderly record of what's in the document.

PDF documents also are by design not particularly editable because nothing is ever deleted from them: even tools that are able to modify/edit documents can only do so by adding new information. The old information — text, images, structure, etc. — is still retained within the file. And it can get to be a bit of a mess in there.

Fade In is, however, able to import a wide range of PDF screenplays. Sometimes, however, it may run into problems importing, and there are at least a couple of reasons why that might be.

The first reason is related to screenplays that were scanned from a printed copy. Scanned PDFs are generally just comprised of images of the text rather than the text itself. (Unless during the scanning process the additional step of OCR — Optical Character Recognition — was taken to add an separate, invisible layer of actual text reflecting the contents of the images of text.)

Another reason that a PDF file may not be imported correctly is this:

PDF insides

That's what the inside of a PDF document looks like.

There really are no requirements for structure or layout within PDF: it's essentially just a container for an arbitrary collection of "objects". These objects can be stored in any number of ways, and in any order. For instance, a PDF document may store the words "quick", "The", "fox", and "brown", in that order, along with instructions to display them onscreen as: "The" "quick" "brown" "fox". (And some PDF creators basically do that.)

The problem then comes with trying to figure out exactly what the layout of the objects is supposed to represent in logical terms. (This is one of the things that human beings are way better at than computers. For now.)

There's also no requirement that PDF store the word "fox" as "fox". It could store it as "D&!" or almost anything else, as long as it also stores the — sometimes byzantine — rules of transliteration.

The greatest obstacle to reliable PDF import is the wide variety of PDF creators and methods of creation which can result in all sorts of variables to be taken into account when trying to extract the content. There may still be a few instances where things don't work perfectly.

In those cases, what you may have to/be able to do is select the text from a PDF viewer and paste it into Fade In. With that method, however, a certain amount of reformatting is almost guaranteed.

 

1By this author

 

Tags: import, pdf

You cannot comment on this entry