software for reading sequence from PDF file

Don Gilbert gilbertd at bio.indiana.edu
Fri Mar 19 17:54:47 EST 1999

Certainly what you ask can be done (extract specific text from PDF),
if the PDF docs are not encrypted/secured by the creators.  Text is
stored as text in PDF, not as bitmap images (unless the PDF was created
from a bitmap image) so you can pull out the text with the right
tool.  PDF format is well documented by Adobe.  

Here are some PDF links
See esp. here for extraction tools

I've written software to create PDF from various graphics/text.
It wasn't too hard.  If you need to write it, software to 
extract text should be a straight-forward programming project 
for some software engineer.  Java is a great match for PDF, since
the standard ZIP libraries of java work on PDF compressed data.

-- Don

In article <717801BBC100D211B89500805F6FAD93047D56 at snap01.synapticcorp.com>,
 <Tvenkatesh at synapticcorp.com> wrote:
>I would like to know if there is software that can convert PDF file into
>text files.
>Specifically we want to extract  sequences from patent documents which are
>stored as images in PDF
>format. We tried Acorobat reader, it did not help.
>I appreciate your help.
>T. V. (Venky) Venkatesh, Ph D
>Senior Scientist (Bioinformatics and Molecular Biology)
>Synaptic Pharmaceutical Corporation
>215 College Road
>Paramus NJ 07652 - 1431
>201-261-1331x720 (Phone)
>Tvenkatesh at synapticcorp.com

-- d.gilbert--biocomputing--indiana-u--bloomington-in-47405
-- gilbertd at bio.indiana.edu

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net