Quantcast

iText help resources?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

iText help resources?

Steve Garcia

Hi,

 

Am trying to pull table data out of PDF files that contain non tabular text as well as the tables.  I've successfully parsed the non tabled text using PdfTextExtractor.GetTextFromPage(), but the resulting text stream is empty at each table location.

 

I'm sure there's a way to do what I need to do, but I can't find documentation for itext.  Suggestions for learning my way out of this delimma?

 

I've attached a sample PDF.  The tables are in the latter part of the file.

 

Thanks,

Steve

 

 


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

testin.pdf (3M) Download Attachment
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: iText help resources?

mkl
Steve,
Steve Garcia wrote
Am trying to pull table data out of PDF files that contain non tabular text as well as the tables.  I've successfully parsed the non tabled text using PdfTextExtractor.GetTextFromPage(), but the resulting text stream is empty at each table location.
The text in the tables cannot be extracted without OCR.

The text in the tables is drawn using type 3 fonts with an ad-hoc encoding, i.e. the first glyph drawn on the page is encoded as 0, the second (differing) glyph as 1, ...

E.g. on page 11 the first text drawn is "B6 Summary (Official Form 6 - Summary) (12/14)" and is encoded as 00, 01, 02, 03, 04, 05, 05, 06, 07, 08, 02, 09, 0A, 0B, 0B, 0C, 0D, 0C, 06, 0E, 02, 0F, ...

Furthermore the font has not mapping to Unicode.

Thus, automated text extraction without some kind of OCR is impossible.

Regards,   Michael
Loading...