Quantcast

Convert PDF to text

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Convert PDF to text

Daniel Haglund
Hi.

I tried searching the archive but could not find a suitable answer.

I would like to know if it is possible to convert a simple (i.e. no images)
PDF-file to text? I have tried using a utility called pdftotext. I does a
pretty good job when invoked with the -layout switch. That switch preserves
the document layout. However pdftotext produces garbage characters for some
fonts it seems.

Anyway, iText looks like a great product and maybe it can convert all PDFs
to text regardless of fonts used? If anyone would have some sample code for
this it would be even better.

Best regards,

Daniel Haglund



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Convert PDF to text

blowagie
Daniel Haglund wrote:

> Hi.
>
> I tried searching the archive but could not find a suitable answer.

That's because there is none.

> I would like to know if it is possible to convert a simple (i.e. no
> images) PDF-file to text? I have tried using a utility called
> pdftotext. I does a pretty good job when invoked with the -layout
> switch. That switch preserves the document layout. However pdftotext
> produces garbage characters for some fonts it seems.

That's very normal. It's 'in the nature' of PDF.
PDF is a one-way process. The PDF is the end product.
You are not supposed to convert it to text.

> Anyway, iText looks like a great product and maybe it can convert all
> PDFs to text regardless of fonts used? If anyone would have some
> sample code for this it would be even better.

You need an OCR tool.
br,
Bruno


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Convert PDF to text

Mark Storer-3
In reply to this post by Daniel Haglund
> I have tried using a utility called
> pdftotext. I does a
> pretty good job when invoked with the -layout switch. That
> switch preserves
> the document layout. However pdftotext produces garbage
> characters for some fonts it seems.

There are at least three ways of including text in a PDF:

1) standard encoding:  The bytes in the PDF to draw the text conform to some known encoding, WinAnsi, UTF-16, whatever.  PDF->Text programs have no trouble with this sort of text.

2) Custom encoding: A byte[s]->characters mapping was generated for this PDF.  This is relatively common in subsetted-fonts.  The first character used might by 0x01, the next 0x02, and so on, regardless of what those characters might be.  PDF->Text programs must exert a little extra effort to decypher this kind of text.  Some do, some don't.  It's hard to tell whether or not "pdftotext" does based on your description.

3) glyph indexes:  The bytes to draw the text directly index 'glyphs' in the font.  These indexes may have been modified in the case of a subsetted font.  The only way to extract information from this sort of text is through OCR (Optical Character Recognition).

4) Paths: There isn't any actual text in the PDF, just curves and straight lines.  Illustrator can convert text to paths, and I'm sure there are other programs out there with the same capability.  This results in a larger file, but you can do Cool Things to paths that you can't with regular text.  OCR is the only way to get information out of this kind of "text"... and because they have often been through processes to do Cool Things, the OCR can have trouble with it... depending on what was done.

5) Images: A pixel map of the text.  OCR is again the only hope.  Some scanned PDFs and company logos are images.

6) Our 3 weapons are fear, suprise, ruthless effeciency, an almost fanatical devotion to the Pope, and nice red uniforms.

--Mark Storer
  Senior Software Engineer
  Cardiff Software

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
<a href="http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642">http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Convert PDF to text

JonyGreen
This post has NOT been accepted by the mailing list yet.
In reply to this post by Daniel Haglund
I find a free online pdf to text converter to convert pdf to editable text online.
Loading...