How can i extract text from pdf including white spaces ??
I used itextshap to extract text from pdf. i used below code to extract text line by line. It is extracting code perfectly only it is not reading white spaces in PDF. specially i need to read white spaces using this.
can i do it using itextsharp library ??
PdfReader reader = new PdfReader(filename);
string text = string.Empty;
for (int page = 1; page <= 1; page++)
text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy());
in this code i have also tried SimpleTextExtractionStrategy at the place of LocationTextExtractionStrategy but it is also not working.
Re: How can i extract text from pdf including white spaces ??
On 7/05/2013 8:08, shailendra3009 wrote:
> I used itextshap to extract text from pdf. i used below code to extract text
> line by line. It is extracting code perfectly only it is not reading white
> spaces in PDF. specially i need to read white spaces using this.
Your question sounds like "I have no money in my wallet; how can I fetch
the zero dollar notes from my wallet?"
In a PDF, all text is added at absolute positions.
For instance: one word is added at position x = 36, y = 806; another
word is added on position x = 300, y = 806. Some other text is added at
position x = 36, y = 790; x = 36, y = 774; x = 36; 742;...
Where are the spaces? There are none!
But by doing the math, you can see that there's a gap between the text
that starts at position x = 36 and the one that starts at position x = 300.
Also, you see a pattern in the y positions: 806 - 16 = 790; 790 - 16 =
774; 774 - 16 = 758; 758 - 16 = 742; ...
This looks like a line was skipped at position 758.
However, as explained multiple times, the concept of a line doesn't
exist in PDF.
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may _______________________________________________
iText-questions mailing list
[hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions
illustrates the reason and hints at a solution: Copy the text extration strategy (either the SimpleTextExtractionStrategy or the LocationTextExtractionStrategy) and tweak the internal parameters, in your case the minimum width of a gap to be recognized as a space, renderInfo.getSingleSpaceWidth()/2f by default; the person who asked back there got improved results using renderInfo.getSingleSpaceWidth()/4f.
I am also having the same issue.
Did you get any solution for that? I want to read the pdf with the spaces as it is there.
Dont want to trim the extra spaces by himself.
Please let me know if you have got any solution.
Thanks in advance.