Quantcast

How can i extract text from pdf including white spaces ??

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How can i extract text from pdf including white spaces ??

shailendra3009
I used itextshap to extract text from pdf. i used below code to extract text line by line. It is extracting code perfectly only it is not reading white spaces in PDF. specially i need to read white spaces using this.

can i do it using itextsharp library ??

        PdfReader reader = new PdfReader(filename);

        string text = string.Empty;

        for (int page = 1; page <= 1; page++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy());  
        }
        reader.Close();

in this code i have also tried SimpleTextExtractionStrategy at the place of  LocationTextExtractionStrategy  but it is also not working.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

iText mailing list
On 7/05/2013 8:08, shailendra3009 wrote:
> I used itextshap to extract text from pdf. i used below code to extract text
> line by line. It is extracting code perfectly only it is not reading white
> spaces in PDF. specially i need to read white spaces using this.
Your question sounds like "I have no money in my wallet; how can I fetch
the zero dollar notes from my wallet?"

In a PDF, all text is added at absolute positions.
For instance: one word is added at position x = 36, y = 806; another
word is added on position x = 300, y = 806. Some other text is added at
position x = 36, y = 790; x = 36, y = 774; x = 36; 742;...

Where are the spaces? There are none!

But by doing the math, you can see that there's a gap between the text
that starts at position x = 36 and the one that starts at position x = 300.

Also, you see a pattern in the y positions: 806 - 16 = 790; 790 - 16 =
774; 774 - 16 = 758; 758 - 16 = 742; ...
This looks like a line was skipped at position 758.

However, as explained multiple times, the concept of a line doesn't
exist in PDF.

See for instance:
http://stackoverflow.com/questions/16392886/need-to-extract-text-line-by-line-from-pdf-using-itextsharp-and-put-enter-at-eve

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

mkl
In reply to this post by shailendra3009
shailendra3009,
shailendra3009 wrote
i used below code to extract text line by line. It is extracting code perfectly only it is not reading white spaces in PDF.
As already pointed out in a comment to your parallel post on StackOverflow, the answer

http://stackoverflow.com/questions/13644419/itext-java-pdf-to-text-creation/13645183#13645183

illustrates the reason and hints at a solution: Copy the text extration strategy (either the SimpleTextExtractionStrategy or the LocationTextExtractionStrategy) and tweak the internal parameters, in your case the minimum width of a gap to be recognized as a space, renderInfo.getSingleSpaceWidth()/2f by default; the person who asked back there got improved results using renderInfo.getSingleSpaceWidth()/4f.

Regards,   Michael
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

Kevin Day
Michael-

Do you think I need to change the default behavior to /4?  Or add a parameter to the location aware strategy (setHeuristicSpaceSensitivity(float) )?

Open to opinions...

mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

mkl
Kevin Day wrote
Do you think I need to change the default behavior to /4?  Or add a parameter to the location aware strategy (setHeuristicSpaceSensitivity(float) )?
More likely the latter option. Maybe even a callback which is given multiple parameters and decides based on then whether there is a space or not.

regards,   Michael
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

Kevin Day
I just committed a change that adds a isChunkAtWordBoundary() method to LocationTextExtractionStrategy.

Subclasses can override this to fine tune the space determination algorithm.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

pittypan
This post has NOT been accepted by the mailing list yet.
In reply to this post by shailendra3009
I think this post will be closer to your question. By the way, there are other posts talking this issue and you can find more via googling.

http://stackoverflow.com/questions/16398483/using-itextsharp-with-spaces-to-extract-text-from-pdf/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

Chirsyu
This post has NOT been accepted by the mailing list yet.
hi all,

Thanks for the great infomation ..This is exactly what I need.Before viewing this post, I also found some useful toturail on this topic by google..Here is an sample on MSDN that may be helpful..check

http://code.msdn.microsoft.com/Extracting-text-and-image-d47ac957

regards.
chris
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

Partho_here
This post has NOT been accepted by the mailing list yet.
In reply to this post by shailendra3009
Hi Shailendra,
I am also having the same issue.
Did you get any solution for that? I want to read the pdf with the spaces as it is there.
Dont want to trim the extra spaces by himself.
Please let me know if you have got any solution.
Thanks in advance.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How can i extract text from pdf including white spaces ??

JonyGreen
This post has NOT been accepted by the mailing list yet.
In reply to this post by shailendra3009
I'm not a developer, i always use this free online pdf text extractor to extract text from pdf online.
Loading...