Quantcast

Itextsharp extact text

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Itextsharp extact text

Paul Durrant

 

 

I'm trying to use  iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);

on the attached PDF but I don't get the text back, if I take the byte array and look at the contents then

the text block is not not in ASCII form although all the co-ordinate structure is correct eg anything between the () is not in ASCII form, how is it possible to get the text from this pdf

 

 

 

thanks Paul

 

 

 




This message is private and confidential. If you have received it in error, you are on notice of its status. Please notify us immediately by reply email and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any other person: to do so could be a breach of confidence.

Emails may be monitored.

Details of Clarkson group companies and their regulators (where applicable) can be found at this url: Disclosure



------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Grains 08-02-2010.pdf (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Itextsharp extact text

Leonard Rosenthol-3

No, it’s not possible (short of OCR) as the software that produced this PDF didn’t encode any useful text information – only displayable glyphs.

 

Leonard

 

From: Paul Durrant [mailto:[hidden email]]
Sent: Thursday, September 02, 2010 12:48 PM
To: '[hidden email]'
Subject: [iText-questions] Itextsharp extact text

 

 

 

I'm trying to use  iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);

on the attached PDF but I don't get the text back, if I take the byte array and look at the contents then

the text block is not not in ASCII form although all the co-ordinate structure is correct eg anything between the () is not in ASCII form, how is it possible to get the text from this pdf

 

 

 

thanks Paul

 

 

 

 


This message is private and confidential. If you have received it in error, you are on notice of its status. Please notify us immediately by reply email and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any other person: to do so could be a breach of confidence.

Emails may be monitored.

Details of Clarkson group companies and their regulators (where applicable) can be found at this url: Disclosure


 


------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Itextsharp extact text

iText mailing list
In reply to this post by Paul Durrant
Paul Durrant wrote:
> I'm trying to use
>  iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);
>
> on the attached PDF but I don't get the text back, if I take the byte
> array and look at the contents then
> the text block is not not in ASCII form although all the co-ordinate
> structure is correct eg anything between the () is not in ASCII form,
> how is it possible to get the text from this pdf

Open the document in File > Document Properties > Fonts
You'll see that a font TTE... was used with as encoding "Built-in".
Read chapters 11 and 15 of the second edition of "iText in Action"
and you should understand that this is an example where it's extremely
difficult to extract the text.

In any case: this is NOT a bug in iText.
This is a nice example of a PDF that can't be parsed with iText.

The encoding of a simple font is a sort of table where a maximum
of 256 characters are mapped with 256 glyphs. For standard encodings
the character 'a' corresponds with a glyph a, /a/ or *a*.

But anyone can use any other encoding where the character 'a'
corresponds with the glyph 'b', the character 'z' corresponds with
the glyph 'a', etc...

That's why you get stuff like this when you parse your file:
!" !"
  &$ ’ () (")
#$ $%
* +
  !"#$%&" ’())$ ’"* ++ + !","-!"’) ’ (.+ (’"’!&/
)(++$00() .+)$ ’(!"1
2 ’ (34 $).
, -- % ! -

These characters corresponds with glyphs, but '!' doesn't corresponds
with the glyph for '!'.

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Itextsharp extact text

Kevin Day
BTW - you can confirm the behavior described above by opening the file in Acrobat and doing a text search.  If Acrobat can't find the text, neither can iText.  If Acrobat *can* find the text, then we have something to learn :-)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Itextsharp extact text

Mark Storer-2
In reply to this post by Paul Durrant
Ouch.  If you cannot copy and paste the text from Reader successfully, that shows that it is Very Hard or impossible.
 
In your case, it is probably impossible.  The font used is a subset, and Reader's failure to translate the glyph indexes into characters leads me to believe that the subset doesn't contain character mapping information (quite legal, just a royal pain).
 
Your only real recourse in cases like this is OCR (optical character recognition).  Fortunately, such cases aren't all that common.  It's entirely possible however, that you're working with nothing but this type of PDF, so that may be small consolation.
 
I wish you luck.
 
--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 
 


From: Paul Durrant [mailto:[hidden email]]
Sent: Thursday, September 02, 2010 9:48 AM
To: '[hidden email]'
Subject: [iText-questions] Itextsharp extact text

 

 

I'm trying to use  iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);

on the attached PDF but I don't get the text back, if I take the byte array and look at the contents then

the text block is not not in ASCII form although all the co-ordinate structure is correct eg anything between the () is not in ASCII form, how is it possible to get the text from this pdf

 

 

 

thanks Paul

 

 

 




This message is private and confidential. If you have received it in error, you are on notice of its status. Please notify us immediately by reply email and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any other person: to do so could be a breach of confidence.

Emails may be monitored.

Details of Clarkson group companies and their regulators (where applicable) can be found at this url: Disclosure


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3102 - Release Date: 09/01/10 23:34:00


------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Itextsharp extact text

JonyGreen
This post has NOT been accepted by the mailing list yet.
In reply to this post by Paul Durrant
You can try this free online ocr to extract text from image.
Loading...