[magick-users] Extract PDF Text
Joseph Kolibal
joseph.kolibal at usm.edu
Sat Nov 24 05:47:24 PST 2007
There are several alternatives within linux. If there are other options,
I am not aware of them. I use pdftotext, i.e.,
pdftotext FILE.pdf
to extract to FILE.txt. Alternatively, if that fails I convert the pdf to
a postscript file using
pdf2ps FILE.pdf
creating FILE.ps and then I try
ps2ascii FILE.ps
or
pstotext FILE.ps
I have also done the conversion from pdf to postscript using adobe acrobat's
acroread, i.e., do
cat FILE.pdf|acroread -toPostScript -start PAGESTART -end PAGEEND > NEWFILE.ps
and obtained different results. In some cases I have also found it convenient
to use the pdftk package, using
pdftk FILE.pdf output NEWFILE.pdf uncompress
in which the output is an uncompressed pdf which can be manipulated with a text
editor to make corrections directly. This package can repair a corrupted pdf, and it may
be necessary in some cases to try this.
Finally, it is sometimes convenient to grab the text from the page display
of the pdf file using the mouse along with kpdf or xpdf. Needless to say, extracting
non-text data such as tables and mathematics does not always succeed as well
as it desired.
Joseph
On Fri, 23 Nov 2007 18:27:10 -0500
Ben Marchbanks <ben at alQemy.com> wrote:
> Is there a way to dump the text from a PDF independent of the PDF to
> image conversion process ?
>
----------------------------------
Joseph Kolibal
The University of Southern Mississippi
Department of Mathematics
118 College Drive 5045
Hattiesburg, MS 39406-0001
E-mail: joseph.kolibal at usm.edu, kolibal at delphi.st.usm.edu
Office: Room 207 Southern Hall, PH: 601-266-4301, FX: 601-266-5818
Web Links:
http://www.math.usm.edu/kolibal (Home pages)
http://www.math.usm.edu/cmi CMI (Computational Mathematics Information)
Further contact:
Department of Mathematics
PH: 601-266-4289/FX: 601-266-5818
http://www.usm.edu/math
Sent: From athena
----------------------------------
More information about the Magick-users
mailing list