Extracting word and finding co-ordinates from pdf (.Net Framework)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting word and finding co-ordinates from pdf (.Net Framework)

Debasis Mandal
Hello,

I am working on extracting text from pdf and want to get exact position of all words (in the form of co-ordinates) from pdf by using itextsharp dll. I am using .Net Framework. But I am facing some problem - when i am extracting words from pdf, I can not get the right words. It's split multiple part of a word. For example, If word="PAGE", first time its render word="PAG" then next render word="E". Also facing same problem for finding co-ordinate of a word.

Can you help me on how to extract word with position(co-ordinates) from pdf in .Net Framework.



Thanks,
Debasis Mandal
 

------------------------------------------------------------------------------
Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
and much more. Keep your Java skills current with LearnJavaNow -
200+ hours of step-by-step video tutorials by Java experts.
SALE $49.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122612 
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word and finding co-ordinates from pdf (.Net Framework)

Kalani Bright
I ended up finding that it was better to use pdfbox for this piece of functionality...though it wasn't for words (pdfs aren't structured that way)...so I ended up getting
positions of individual characters and compared to a known piece of plain text for a page to figure out what word/where it was (combined with relative position).

pdfbox got me the char and char position though...though thats java :(



On 1/8/13 9:53 PM, Debasis Mandal wrote:
Hello,

I am working on extracting text from pdf and want to get exact position of all words (in the form of co-ordinates) from pdf by using itextsharp dll. I am using .Net Framework. But I am facing some problem - when i am extracting words from pdf, I can not get the right words. It's split multiple part of a word. For example, If word="PAGE", first time its render word="PAG" then next render word="E". Also facing same problem for finding co-ordinate of a word.

Can you help me on how to extract word with position(co-ordinates) from pdf in .Net Framework.



Thanks,
Debasis Mandal
 


------------------------------------------------------------------------------
Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
and much more. Keep your Java skills current with LearnJavaNow -
200+ hours of step-by-step video tutorials by Java experts.
SALE $49.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122612 


_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php


------------------------------------------------------------------------------
Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
and much more. Keep your Java skills current with LearnJavaNow -
200+ hours of step-by-step video tutorials by Java experts.
SALE $49.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122612 
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
mkl
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word and finding co-ordinates from pdf (.Net Framework)

mkl
In reply to this post by Debasis Mandal
Debasis Mandal,
Debasis Mandal wrote
I am working on extracting text from pdf and want to get exact position of all words (in the form of co-ordinates) from pdf by using itextsharp dll. I am using .Net Framework. But I am facing some problem - when i am extracting words from pdf, I can not get the right words. It's split multiple part of a word. For example, If word="PAGE", first time its render word="PAG" then next render word="E". Also facing same problem for finding co-ordinate of a word.
Unfortunately you have not told us how you try to extract words from your PDF. Thus, I have to guess what you are doing.

I assume you have implemented your own RenderListener and process each TextRenderInfo immediately when you receive it, starting from the premise that each word completely is contained in one TextRenderInfo.

This premise is wrong. PDF page content contains numerous groups of glyphs each of which is to be displayed starting from some respective starting position. These groups of glyphs may be anything; whole text lines, multiple words, single words, word parts, individual characters; they even may contain the end of one word and the start of the next, but none of them completely. Furthermore the groups don't even have to appear in some reading order.

And each TextRenderInfo represents one such glyph group.

Thus, to find the coordinates of the words, you have to collect all glyph groups / TextRenderInfos which may build your word and then determine the coordinates.

The source of the LocationTextExtractionStrategy shows you how to collect and sort the text render information objects.

Some more hints can e.g. be found in this item on stackoverflow: http://stackoverflow.com/questions/13714605/retrieve-the-respective-coordinates-of-all-words-on-the-page-with-itextsharp/13719947

Regards,   Michael
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word and finding co-ordinates from pdf (.Net Framework)

karthi_beit
This post has NOT been accepted by the mailing list yet.
Hi,

I am facing the same issue please help me. am not getting full word. my code
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
            {
                //Hold each coordinate
                public List<RectAndText> myPoints = new List<RectAndText>();
               
                //Automatically called for each chunk of text in the PDF
                public override void RenderText(TextRenderInfo renderInfo)
                {
                    base.RenderText(renderInfo);
                   
                    //Get the bounding box for the chunk of text
                    var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
                    var topRight = renderInfo.GetAscentLine().GetEndPoint();

                    //Create a rectangle from it
                    var rect = new iTextSharp.text.Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);

                    //Add this to our main collection
                    this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
                }
            }

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word and finding co-ordinates from pdf (.Net Framework)

blowagie
This post has NOT been accepted by the mailing list yet.
Please read http://itextpdf.com/nabble

If you look at your question, you see that "This post has NOT been accepted by the mailing list yet" and it never will. Please use the appropriate channel to post a question: http://itextpdf.com/support