Quantcast

Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Keith O-2
Hi,

Seen a few threads on stackoverflow.com with the same problem - "Index
was outside the bounds of the array" exception when parsing certain
PDFs. Attached the smallest sample PDF I could find to reproduce the
problem. Had the same issue when running a few other large PDFs.
(electronics owners manuals)

Thanks!

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

PdfTextExtractorTest.cs (960 bytes) Download Attachment
ADAC.pdf (85K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Kevin Day
If you can provide the full stack trace, it would be a big help.  Links to the SO articles would also be useful if you still have them handy.

I did find a problem with memory mapped files this morning - will be commiting a fix in a few minutes, but I can't tell you for sure if it's related.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Keith O-2
On Mon, Jan 30, 2012 at 10:56 PM, Kevin Day <[hidden email]> wrote:
> If you can provide the full stack trace, it would be a big help.  Links to
> the SO articles would also be useful if you still have them handy.

================
[IndexOutOfRangeException: Index was outside the bounds of the array.]
   iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.GetResultantText()
+505
   iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader
reader, Int32 pageNumber, ITextExtractionStrategy strategy) +52
   iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader
reader, Int32 pageNumber) +40
================

Here are a couple of links to the SO questions:

http://stackoverflow.com/questions/8951408/index-was-outside-the-bounds-of-the-array-while-reading-a-pdf-using-itextsharp

http://stackoverflow.com/questions/8578793/itextsharp-v5-gettextfrompage-throws-indexoutofrangeexception

Here is the most interesting one, and also has the most information on
what the user tried to extract images from a PDF:

http://stackoverflow.com/questions/8493559/why-is-my-image-distorted-when-decoding-as-flatedecode-using-itextsharp/8511314#8511314

None of them have links to example PDFs, however...

Thanks!

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Kevin Day
ok - I tested your PDF in the Java version of iText (latest code from HEAD) and it does *not* fail.  Given the stack trace, I'm pretty sure that this is an issue that has been fixed - basically, if the text render operation had an empty string, we were winding up with an index out of bounds exception.  Latest Java code definitely fixes that issue - I'm not sure where things are at with rolling that into the C# code base.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Paulo Soares-3
I suspect that this is also fixed in the iTextSharp HEAD. In any case, the Java and C# versions will be synchronized this weekend.

Paulo

-----Original Message-----
From: Kevin Day [mailto:[hidden email]]
Sent: Tuesday, January 31, 2012 3:17 PM
To: [hidden email]
Subject: Re: [iText-questions] Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

ok - I tested your PDF in the Java version of iText (latest code from HEAD)
and it does *not* fail.  Given the stack trace, I'm pretty sure that this is
an issue that has been fixed - basically, if the text render operation had
an empty string, we were winding up with an index out of bounds exception.
Latest Java code definitely fixes that issue - I'm not sure where things are
at with rolling that into the C# code base.

--
View this message in context: http://itext-general.2136553.n4.nabble.com/Possible-bug-in-PdfTextExtractor-GetTextFromPage-iTextSharp-tp4342445p4344782.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Keith O-2
On Tue, Jan 31, 2012 at 4:22 PM, Paulo Soares <[hidden email]> wrote:
> I suspect that this is also fixed in the iTextSharp HEAD. In any case, the Java and C# versions will be synchronized this weekend.

Yes, there's no problem when building from the latest SVN source code
with the test file I had attached, thank you!

While browsing sourceforge to get the code from SVN,  I noticed
someone else had submitted a similar bug report, ID #3474281. I tested
with that file. Here's the stacktrace (using latest build):

========================================
Unhandled Exception: System.IndexOutOfRangeException: Index was
outside the bounds of the array.
   at iTextSharp.text.pdf.CMapAwareDocumentFont.GetWidth(Int32 char1)
   at iTextSharp.text.pdf.parser.TextRenderInfo.GetStringWidth(String str)
   at iTextSharp.text.pdf.parser.TextRenderInfo.GetUnscaledBaselineWithOffset(Single
yOffset)
   at iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.RenderText(TextRenderInfo
renderInfo)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayPdfString(PdfString
str)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ShowTextArray.Invoke(PdfContentStreamProcessor
processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral
oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[]
contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32
pageNumber, E renderListener)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader
reader, Int32 pageNumber)
   at PdfTextExtractorTest.Main(String[] args)
========================================

Thanks Paulo!

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Keith O-2
In reply to this post by Kevin Day
On Tue, Jan 31, 2012 at 4:16 PM, Kevin Day <[hidden email]> wrote:
> ok - I tested your PDF in the Java version of iText (latest code from HEAD)
> and it does *not* fail.  Given the stack trace, I'm pretty sure that this is
> an issue that has been fixed - basically, if the text render operation had
> an empty string, we were winding up with an index out of bounds exception.
> Latest Java code definitely fixes that issue - I'm not sure where things are
> at with rolling that into the C# code base.

Thank you - I downloaded the latest from SVN, and indeed the test file
I submitted previously works without problem too.

Another file I tested but didn't submit due to large file size still
doesn't work with the lastest C# build, though. If you have time, it's
located here:

http://www.navigon.com/export/sites/default/common/Download/Manual/PNA/NAVIGON70/English_manual.pdf

stacktrace from that file with latest SVN C# build:

=============================================
Unhandled Exception: System.IndexOutOfRangeException: Index was
outside the bounds of the array.
   at iTextSharp.text.pdf.CMapAwareDocumentFont.GetWidth(Int32 char1)
   at iTextSharp.text.pdf.parser.TextRenderInfo.GetStringWidth(String str)
   at iTextSharp.text.pdf.parser.TextRenderInfo.GetUnscaledBaselineWithOffset(Single
yOffset)
   at iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.RenderText(TextRenderInfo
renderInfo)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayPdfString(PdfString
str)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral
oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[]
contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32pageNumber,
E renderListener)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader
reader, Int32 pageNumber)
   at PdfTextExtractorTest.Main(String[] args)
=============================================

Thank you!

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Kevin Day
ok - that one is going to be caused by something deeper in the unicode, font metrics, etc... area of iText - I'll need to rely on the others for digging into that.

Are you able to isolate which page is causing this issue?  It really should be possible to get a single page that causes the problem, and having that will help quite a bit in getting a fix.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Keith O-2
On Tue, Jan 31, 2012 at 5:42 PM, Kevin Day <[hidden email]> wrote:
> Are you able to isolate which page is causing this issue?  It really should
> be possible to get a single page that causes the problem, and having that
> will help quite a bit in getting a fix.

For the link posted earlier (136 pages):

http://www.navigon.com/export/sites/default/common/Download/Manual/PNA/NAVIGON70/English_manual.pdf

The following pages throw an exception:

9,15,17,18,19,21,23,24,25,26,27,28,29,31,32,34,35,37,38,39,40,41,42,
43,44,45,46,47,48,49,50,51,52,53,54,58,60,61,62,63,64,65,66,67,68,69,
70,71,72,73,77,78,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,99,
100,101,102,104,105,106,107,108,109,110,111,112,113,114,115,117,118,
119,120,121,122,123,129,130,131,132

For the other post in reply to Paulo (sourceforge bug report) all six
pages throw an exception.

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

newton
In reply to this post by Kevin Day
I have the same problem with index outside the bounds...

I was converting following PDF =>

http://zbierka.sk/ov/kapitoly/default.aspx?KapitolaID=64396&FileName=ov2012-018-01&Rocnik=2012&TypKapitolyID=1

I have downloaded your source code...

And FIRST ERROR was in the CMapAwareDocumentFont.GetWidth(int char1)
- input was 327 (representing slovak character Ň), which was transformed by

char1 = uni2cid[char1];

into 277, but after accessing by widths[char1] it gave me the error, because variable widths was initialized with 256 items... I had to increase size of that variable to avoid this mistake...

SECOND ERROR was in LocationTextExtractionStrategy.GetResultantText()

where following condition was missing => !string.IsNullOrEmpty(lastChunk.text)
in this condition
else if (dist > chunk.charSpaceWidth / 2.0f && chunk.text[0] != ' ' && lastChunk.text[lastChunk.text.Length - 1] != ' ')
                            sb.Append(' ');

After repairing these errors and building dll, conversion worked perfectly... Will you be so kind to take a look at our slovak diacritic and also repair in your official release ? If you already did it, just ignore my message... Thanks a lot...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

Kevin Day
I believe that the bug in LocationTextExtractionStrategy.GetResultantText() was fixed some time ago - did you experience this problem with the latest code in HEAD ?

for reference, the line in question in SVN has the following (And startsWithSpace and endsWithSpace has the null and empty conditions covered):

                    else if (dist > chunk.charSpaceWidth/2.0f && !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text))



I will need to ask Paulo to look at this - I don't quite know the full implications of the uni2cid array - trying to maintain an array that is the length of the full unicode set isn't practical - increasing the array to 512 or something may address the current situation you find yourself in, but this seems to me like something that needs a more robust fix, and the whole unicode/cid transformation stuff is outside of my expertise.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug inPdfTextExtractor.GetTextFromPage [iTextSharp]

Paulo Soares-3
In reply to this post by newton

Fixed in the SVN.
 
Paulo
----- Original Message -----
Sent: Saturday, February 04, 2012 5:43 PM
Subject: Re: [iText-questions] Possible bug inPdfTextExtractor.GetTextFromPage [iTextSharp]

I have the same problem with index outside the bounds...

I was converting following PDF =>

http://zbierka.sk/ov/kapitoly/default.aspx?KapitolaID=64396&FileName=ov2012-018-01&Rocnik=2012&TypKapitolyID=1
http://zbierka.sk/ov/kapitoly/default.aspx?KapitolaID=64396&FileName=ov2012-018-01&Rocnik=2012&TypKapitolyID=1

I have downloaded your source code...

And FIRST ERROR was in the CMapAwareDocumentFont.GetWidth(int char1)
- input was 327 (representing slovak character Ň), which was transformed by

char1 = uni2cid[char1];

into 277, but after accessing by widths[char1] it gave me the error, because
variable widths was initialized with 256 items... I had to increase size of
that variable to avoid this mistake...

SECOND ERROR was in LocationTextExtractionStrategy.GetResultantText()

where following condition was missing =>
!string.IsNullOrEmpty(lastChunk.text)
in this condition
else if (dist > chunk.charSpaceWidth / 2.0f && chunk.text[0] != ' ' &&
lastChunk.text[lastChunk.text.Length - 1] != ' ')
                            sb.Append(' ');

After repairing these errors and building dll, conversion worked
perfectly... Will you be so kind to take a look at our slovak diacritic and
also repair in your official release ? If you already did it, just ignore my
message... Thanks a lot...

--
View this message in context: http://itext-general.2136553.n4.nabble.com/Possible-bug-in-PdfTextExtractor-GetTextFromPage-iTextSharp-tp4342445p4357522.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

JonyGreen
This post has NOT been accepted by the mailing list yet.
In reply to this post by Keith O-2
I'm not a developer, i always use this free online pdf to text converter to extract text from pdf page online.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp]

AdeleB
This post has NOT been accepted by the mailing list yet.
In reply to this post by Keith O-2
Loading...