Page 1 of 1

Multi-page pdf to single tiff or jpg file

Posted: 2007-10-18T08:04:39-07:00
by BarryII
I have to turn a multi-page pdf file that contains images of text into a jpg or tiff file so the OCR software in HP Document Viewer can read it. The OCR software can't batch-process so I want to feed it a single file containing all pages from the pdf file.

I'm not sure how to determine the proper resolution when converting and I don't know how to determine the resolution of the original pdf file. I'm also not sure if the conversion that I tried from pdf to tiff (using Ghostscript) was lossy, or if the appending that I did with ImageMagick to create a single file was lossy. I want lossless. The OCR might have actually been more accurate on the jpg than the tiff based on a spell check that found 509 misspellings in the OCR translation of the jpg ( http://www.polisource.com/misc/ocr-test.txt ) and 540 in the OCR translation of the tiff ( http://www.polisource.com/misc/ocr-test-2.txt ), so I was wondering if I need to add something to the commands to make it non-lossy.

The source pdf document is a local copy of http://democrats.science.house.gov/Medi ... A%20IG.pdf

Here are the commands I used (first I used Ghostscript to convert, then I discovered that I need ImageMagick to join the pages):

Code: Select all

gswin32c -sOutputFile=PCIE-Report-%03d.tif -r150 -sDEVICE=tiff32nc -dBATCH -dNOPAUSE "document.pdf"

convert PCIE-Report-001.tif PCIE-Report-002.tif PCIE-Report-003.tif PCIE-Report-004.tif PCIE-Report-005.tif PCIE-Report-006.tif PCIE-Report-007.tif PCIE-Report-008.tif PCIE-Report-009.tif PCIE-Report-010.tif PCIE-Report-011.tif PCIE-Report-012.tif PCIE-Report-013.tif -append PCIE-Report-Image.tif
Then I changed the resolution of PCIE-Report-Image.tif to 300 dpi using XNview before I used OCR, otherwise the OCR software would reject the file because of low resolution. In the first command I use a resolution of 150 because 300 pretty much crashed my computer.

When I made the jpg to compare to the tiff I just changed the extension of PCIE-Report-Image at the end of the second command.

Sorry this turned out to be so long... basically I'm asking how to convert a pdf document into a single tiff file for use by OCR software.

Re: Multi-page pdf to single tiff or jpg file

Posted: 2007-10-18T11:08:46-07:00
by Bonzo
To convert the pdf to tiff:

Code: Select all

convert input.pdf output-%d.tiff
This should give you multiple tiff's with the names in the format of output-0.tiff output-1.tiff

You then need to join them together with append

Code: Select all

convert -background none -gravity Center output-0.tiff output-1.tiff output-2.tiff output-3.tiff -append vertical.tiff

Re: Multi-page pdf to single tiff or jpg file

Posted: 2007-10-19T09:23:22-07:00
by BarryII
Converting from pdf to tiff using your ImageMagick command produces pages that look like this - 59.7 K files that are unsuitable for OCR due to bad quality. This is the pdf file that I'm converting to tiff. I want the resolution of the tiff to be the same as the pdf file, which is an image of text. I want the full sized (100%) view of the pdf version to look the same as the same size tiff.

When I used the Ghostscript command:

Code: Select all

gswin32c -sOutputFile=PCIE-Report-%03d.tif -r150 -sDEVICE=tiff32nc -dBATCH -dNOPAUSE "document.pdf"
I got pages with good resolution, like this, which is a 7.7 megabyte file, but the 150 resolution that I specified in the command is arbitrary. I don't know if it's lower or higher than necessary for the best quality conversion.

Re: Multi-page pdf to single tiff or jpg file

Posted: 2007-10-19T09:52:52-07:00
by Bonzo
I know nothing about tiff all I can suggest is search the forum and see what you find. I did a quick scan and there was a comment about using -density 300x300 in the code but in my test there was not much effect.
You could try sharpen and as you are in black and white you may be able to do other things.

Re: Multi-page pdf to single tiff or jpg file

Posted: 2007-10-21T16:47:04-07:00
by anthony
OCR images generally expect the text to be scanned in from a scanner, with a density anywhere from 300 to 2400 dpi. Anti-aliasing in the text is sometimes desirable, sometimes not, depending on the OCR software.

Now this basically means that appending all your PDF pages together at this density is going to be a MASSIVE image, that OCR software may not be able to handle on its own. The better idea is to process the pages one page at a time, then append the resulting text as appropriatally.

An alternative to actual text conversion is the use of the image file format such a DjVu. This basically scans the text for sub-images that are duplicates (letters) on a fixed color background. That is it replaces the image with a form that is extrememly compressed, easy to process later for actual OCR and still preserves graphics, and the fonts that was used in the text. This format is quickly becomming the ideal scanned format. (IM has not yet bee updated to handle this format though :( )