adding text to page stops acrobat OCR.. is there a workaround?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

adding text to page stops acrobat OCR.. is there a workaround?

Nicholas Mistry

I am using iText to merge a series of tiff documents into PDF.  After the merge
we use acrobat professional 7's OCR to allow us to search the entire document.
 This works great.

Recently, i added some page marks (text) at the top of each page (using iText),
denoting page numbers, etc..     This now broke the ability to have Acrobat OCR
the PDF, since it now contains rendered text.  

My question, is there a way to add some annotations to the page, but still allow
acrobat to OCR the page?    

My intial gut feeling is to render the text as images, and place them on the
doc... but i wanted ask if there was an easier way.

Thanks

-N





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: adding text to page stops acrobat OCR.. is there a workaround?

Leonard Rosenthol
At 01:51 PM 3/23/2006, Nicholas Mistry wrote:
>I am using iText to merge a series of tiff documents into
>PDF.  After the merge
>we use acrobat professional 7's OCR to allow us to search the entire
>document.
>  This works great.

         OK.


>Recently, i added some page marks (text) at the top of each page
>(using iText),
>denoting page numbers, etc..     This now broke the ability to have
>Acrobat OCR
>the PDF, since it now contains rendered text.

         That is correct, as the Acrobat OCR engine will ONLY process
"image only" documents.


>My question, is there a way to add some annotations to the page, but
>still allow
>acrobat to OCR the page?

         I would recommend that you OCR first and THEN apply your
"annotations".


>My intial gut feeling is to render the text as images, and place them on the
>doc... but i wanted ask if there was an easier way.

         That won't help either, as you can only have a SINGLE image
on the page for Acrobat to OCR it.  It won't do it for multiple ones, IIRC.


Leonard

---------------------------------------------------------------------------
Leonard Rosenthol                            <mailto:[hidden email]>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                              215-938-0880 (fax)



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: adding text to page stops acrobat OCR.. is there a workaround?

Nicholas Mistry
First off, thanks for the reply!

Leonard Rosenthol wrote:

>> My question, is there a way to add some annotations to the page, but
>> still allow
>> acrobat to OCR the page?
>
>
>         I would recommend that you OCR first and THEN apply your
> "annotations".

Unfortunately, i agree, but at this point it is not possible to OCR
first in acrobat.   I perform a bunch of operations on the tiffs, and
bookmark them nicely.  

Is there a way to OCR the tiffs and put the text under image like adobe
does.  From what i have read, iText does not easly support this, and i
have not come across a product that does this on the cheap.   (under
$2000)    Any suggestions?  

Has anyone integrated gocr output into a searchable image PDF?

Honestly, I dont know how much effort is required to replecate the
"document capture" feature of acrobat..

>> My intial gut feeling is to render the text as images, and place them
>> on the
>> doc... but i wanted ask if there was an easier way.
>
>
>         That won't help either, as you can only have a SINGLE image on
> the page for Acrobat to OCR it.  It won't do it for multiple ones, IIRC.

Correct...   What I actually was referring to was rendering the text on
the image file.  I guess i was missleading w/ the term "doc".      So,
what i was referring to was taking the tiff, stretching the canvas, and
rendering the text directly on the image.    Then importing it into PDF
using iText.   (sounds like digital fax technology).


-N











-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: adding text to page stops acrobat OCR.. is there a workaround?

Nicholas Mistry
In reply to this post by Leonard Rosenthol
Leonard Rosenthol wrote:

> At 01:51 PM 3/23/2006, Nicholas Mistry wrote:
>
>> I am using iText to merge a series of tiff documents into PDF.  After
>> the merge
>> we use acrobat professional 7's OCR to allow us to search the entire
>> document.
>>  This works great.
>
>
>         OK.
>
>
>> Recently, i added some page marks (text) at the top of each page
>> (using iText),
>> denoting page numbers, etc..     This now broke the ability to have
>> Acrobat OCR
>> the PDF, since it now contains rendered text.
>
>
>         That is correct, as the Acrobat OCR engine will ONLY process
> "image only" documents.
>
>
>> My question, is there a way to add some annotations to the page, but
>> still allow
>> acrobat to OCR the page?
>
>
>         I would recommend that you OCR first and THEN apply your
> "annotations".
>
>
>> My intial gut feeling is to render the text as images, and place them
>> on the
>> doc... but i wanted ask if there was an easier way.
>
>
>         That won't help either, as you can only have a SINGLE image on
> the page for Acrobat to OCR it.  It won't do it for multiple ones, IIRC.
>

Well, i just wrote a test program that inserts multiple tiff files on a
page.   Surprizingly acrobat actually OCR'd it.  Waching the status bar
closely, Acrobat first rasterizes the entire page, and then passes it to
the  OCR engine.  

This was tested on Acrobat 7 Professional, im not sure about previous
versions.

Now this leads me to another question...    Why couldnt they have
rasterized the text as well?   Or better yet, ignore the text portion
completely..  

Anyways, its a workaround...   for now..    I am still interested in
learning how to create searchable images, and may add it to the app later..

Thanks again!

-N



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: adding text to page stops acrobat OCR.. is there a workaround?

Leonard Rosenthol
In reply to this post by Nicholas Mistry
At 01:00 AM 3/24/2006, Nicholas Mistry wrote:
>Is there a way to OCR the tiffs and put the text under image like
>adobe does.  From what i have read, iText does not easly support
>this, and i have not come across a product that does this on the
>cheap.   (under $2000)    Any suggestions?

         I don't know about pricing but there a numerous alternatives
to Acrobat Capture out there - in fact, most are MUCH BETTER and flexible.


>Has anyone integrated gocr output into a searchable image PDF?

         Not that I know of - but it could probably be done...


>Honestly, I dont know how much effort is required to replecate the
>"document capture" feature of acrobat..

         You need an OCR engine and a PDF library.  iText is the
latter - you just need to find the former.   I will note, however,
that NONE of the open source/free ones will give you anywhere near
the accuracy/quality of Capture.  And Capture SUCKS compared to the
serious commercial applications.


>Correct...   What I actually was referring to was rendering the text
>on the image file.  I guess i was missleading w/ the term
>"doc".      So, what i was referring to was taking the tiff,
>stretching the canvas, and rendering the text directly on the
>image.    Then importing it into PDF using iText.   (sounds like
>digital fax technology).

         Sure - that will work.


Leonard

---------------------------------------------------------------------------
Leonard Rosenthol                            <mailto:[hidden email]>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                              215-938-0880 (fax)



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: adding text to page stops acrobat OCR.. is there a workaround?

mikelilin
In reply to this post by Nicholas Mistry
you can try this free online pdf ocr to convert pdf to text online.
Loading...