Converting to Tagged PDF

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Converting to Tagged PDF

dlz613
This post has NOT been accepted by the mailing list yet.
Hi.
Is there any way to use iText to convert a non-tagged PDF to tagged?
Reply | Threaded
Open this post in threaded view
|

Re: Converting to Tagged PDF

blowagie
A Tagged PDF is a PDF that contains semantic information.

Suppose that you have a table in a PDF that is not tagged.

* To the human eye, that table consists of columns and rows. Maybe there is a header row and a footer row, and so on.
* To a machine, that table consists of nothing more than lines and shapes and snippets of text added on arbitrary positions. A machine doesn't know which text belongs to which cell. A machine doesn't know if a row is a header row, a body row, or a footer row.

In a tagged PDF, you will add special marks (we talk about "marked content") to indicate what is what. When a machine is presented a tagged PDF with a table, it knows which parts of the table are the header, which parts are the body, which parts are the footer. Usually, the content will also be added in the logical reading order. That is not the case for a PDF that isn't tagged. In a PDF that isn't tagged, you can add all the regular text first, then all the bold text, then all the italic text,... it really doesn't matter where the text is on the page since all text is added a absolute positions anyway.

If you understand everything I wrote above, you should realize that software that can turn a PDF without tags into a tagged PDF without human interaction either doesn't exist, or it fools the customer.

It doesn't exist, because it takes a human to teach the software which parts of the content are table headers, table cells, titles, paragraphs,... It takes a human to teach the software what is shown in images so that "Alt text" can be provided. Tagging a PDF correctly requires human intelligence.

If you do find software that converts a PDF without tags to a PDF that is tagged from a technical point of view, you fool the customer. You could for instance tag all the content as one big paragraph, and you could add Alt text for images that says nothing more than "This is an image."

Since I assume that you want to do a good job, I am confident that you are not asking us to help you fool your customer. We hope that you will tell your customer in all honesty that he is asking something that can't be done.