Page 1 of 1

ccitt files from pdfimages

Posted: 2020-05-09T03:54:52-07:00
by muccigrosso
I frequently use pdfimages to get the images out of PDFs. I like to use the "-all" switch to make sure that the images are extracted in their original format. Often I get pairs of ccitt files, with one file containing the data and the other the parameters, like "test-001.ccitt" and "test-001.param". This is described in the pdfimages man file.

My question is how to deal with these files. Can IM read them? Is there another way to convert them?

Re: ccitt files from pdfimages

Posted: 2020-05-09T09:37:12-07:00
by fmw42
ImageMagick will use Ghostscript to rasterize the PDF. You are better off using pdfimages.

Formats that ImageMagick supports are listed for your computer using

Code: Select all

convert -list format
A generic list is at

https://imagemagick.org/script/formats.php

CCITT seems to be a TIFF fax compression format. So perhaps your image is a binary compressed TIFF file. See https://www.leadtools.com/help/sdk/v20/ ... rmats.html

Have you tried opening those files with ImageMagick. Best to simply try.

Re: ccitt files from pdfimages

Posted: 2020-05-09T15:57:07-07:00
by muccigrosso
Thanks for the reply.

Indeed I'm not looking to convert the pdf with IM, just work with the resulting ccitt files from pdfimages. And yes, I've tried to just open the .ccitt file with IM and get the following error:

Code: Select all

magick: no decode delegate for this image format `CCITT' @ error/constitute.c/ReadImage/562.
An example param file reads, in its entirety:

Code: Select all

-4 -P -X 3450 -B -M
which according to the man page means that this is a Group 4 encoded image, 3,450 px wide, using 0 for black, 1 for white, data filled from most to least sig digit, and the beginning of line is not aligned on a byte boundary. And, yes, the ccitt file seems like a binary.

I'm not sure what to do with the page you linked.

I do create ccitt files (tiffs) all the time with IM.

Version: ImageMagick 7.0.10-10 Q16 x86_64 2020-05-01 https://imagemagick.org on MacOS 10.13

Re: ccitt files from pdfimages

Posted: 2020-05-09T16:06:11-07:00
by fmw42
Looks like pdfimages, separated the binary image and its header and make two files in place of just one. Sorry, I do not know how to recombine them. But perhaps there are flags in pdfimages to set the output format, say, to TIFF, and perhaps that will keep them together. This is more of a question for the pdfimages developers than ImageMagick.

Re: ccitt files from pdfimages

Posted: 2020-05-09T16:21:47-07:00
by snibgo
@muccigrosso: Can you link to a sample PDF with embedded CCITT image?

Perhaps the binary ccitt file can be read with IM's raw facility, if you supply the image "-size" and "-depth".

Re: ccitt files from pdfimages

Posted: 2020-05-11T06:02:07-07:00
by muccigrosso

Re: ccitt files from pdfimages

Posted: 2020-05-11T08:03:17-07:00
by snibgo
It has 11841 bytes, or 94728 bits. If the width is 3450 pixels then the height would be about 27 pixels, which I suppose is wrong. I conclude that alex-038.ccitt is compressed, and IM's raw reader can't read it.

Re: ccitt files from pdfimages

Posted: 2020-05-12T06:16:25-07:00
by muccigrosso
A little hunting on StackOverflow turned up this question which provided a solution: fax2tiff

You can feed the ccitt file to fax2tiff, using the contents of the param file as the options for the command (I'm doing this on the command line), and throwing in -8 to make sure the output tiff is Group 4 compressed. Something like this:

Code: Select all

fax2tiff `cat extracted_image.params` -8 -o output.tiff extracted_image.ccitt

Re: ccitt files from pdfimages

Posted: 2020-05-13T03:37:20-07:00
by muccigrosso
A little follow up.

If I make sure that fax2tiff creates an output file with the same parameters as the input which in this case means Group 4 compression and forcing output data to have bits filled from most significant bit ( MSB ) to most least bit ( LSB ), the tiff is nearly identical to the input ccitt. There's just a little data up front and at the end that is different. xxd shows this for the start of one tiff:

Code: Select all

00000000: 4949 2a00 4a2e 0000 ffff ffff ffff ffff  II*.J...........
00000010: ffff ffff ffff fffe 5811 99c6 192e 642e  ........X.....d.
vs this for the ccitt:

Code: Select all

00000000: ffff ffff ffff ffff ffff ffff ffff fffe  ................
00000010: 5811 99c6 192e 642e 9603 0cb4 4329 0fd7  X.....d.....C)..
So only the first eight octets(?) are prefixed. Limited testing (on seven files from the same PDF) suggests that it's just the second half of those that differ. The start is always

Code: Select all

4949 2a00
. Here are the first lines from those seven files:

Code: Select all

00000000: 4949 2a00 2c45 0400  II*.,E..
00000000: 4949 2a00 864a 0400  II*..J..
00000000: 4949 2a00 341f 0400  II*.4...
00000000: 4949 2a00 76b6 0000  II*.v...
00000000: 4949 2a00 f298 0000  II*.....
00000000: 4949 2a00 6898 0000  II*.h...
00000000: 4949 2a00 1ac2 0000  II*.....
Appended to the end of the tiff is more data that replaces a small amount from the ccit. This varies by file (except for the "fax2tiff" at the very end). The appendix starts at that "f0" octet:

Code: Select all

00002e38: ffff ffff ffff ffff ffff ffff fff0 0100  ................
00002e48: 1000 1200 0001 0300 0100 0000 7a0d 0000  ............z...
00002e58: 0101 0300 0100 0000 5114 0000 0201 0300  ........Q.......
00002e68: 0100 0000 0100 0000 0301 0300 0100 0000  ................
00002e78: 0400 0000 0601 0300 0100 0000 0000 0000  ................
00002e88: 0a01 0300 0100 0000 0100 0000 1101 0400  ................
00002e98: 0100 0000 0800 0000 1201 0300 0100 0000  ................
00002ea8: 0100 0000 1501 0300 0100 0000 0100 0000  ................
00002eb8: 1601 0400 0100 0000 ffff ffff 1701 0400  ................
00002ec8: 0100 0000 412e 0000 1a01 0500 0100 0000  ....A...........
00002ed8: 282f 0000 1b01 0500 0100 0000 302f 0000  (/..........0/..
00002ee8: 1c01 0300 0100 0000 0100 0000 2501 0400  ............%...
00002ef8: 0100 0000 0000 0000 2801 0300 0100 0000  ........(.......
00002f08: 0200 0000 2901 0300 0200 0000 0000 0100  ....)...........
00002f18: 3101 0200 0900 0000 382f 0000 0000 0000  1.......8/......
00002f28: cc00 0000 0100 0000 c400 0000 0100 0000  ................
00002f38: 6661 7832 7469 6666 00                   fax2tiff.

Re: ccitt files from pdfimages

Posted: 2020-05-13T04:04:07-07:00
by magick
Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png

Re: ccitt files from pdfimages

Posted: 2020-05-13T10:37:58-07:00
by muccigrosso
magick wrote:
2020-05-13T04:04:07-07:00
Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png
This works to create a viable png, except that the image height is incorrect and so the bottom of the image is missing. In this particular case it's 5201, according to the output tiff from fax2tiff. IM actually gives an error if I use that as the height:

Code: Select all

magick: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/1037.
Which is interesting because if I put fax2tiff in verbose mode it reports a similar error:

Code: Select all

Fax4Decode: Warning, Premature EOL at line 5200 of strip 4294967295 (got 0, expected 3450).
alex-039.ccitt:
5201 rows in input
0 total bad rows
0 max consecutive bad rows

Re: ccitt files from pdfimages

Posted: 2020-05-17T02:06:03-07:00
by muccigrosso
magick wrote:
2020-05-13T04:04:07-07:00
Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png
Trying to work with this further, so that I could use just IM to handle this, partly because fax2tiff errs in creating a file with an extra row in it (in this case, 5201 instead of 5200).

If I give "convert" an absurdly large number for the image height, it reports a bunch of errors as it tries to read beyond the file (I guess), and eventually reaches its limit (128) and outputs a file with the dimensions I gave it. Here are the errors:

Code: Select all

convert: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/896.
convert: Premature EOF at line 5200 of strip 0 (x 0). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/896.
If I use "magick" instead, it just outputs one error message, but produces the same file:

Code: Select all

magick: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/1037
Is there a way to make either command simply stop the first time it hits the error and therefore not produce an image of the height requested?

Alternatively I suppose I could process the error output from something like the following where I give an absurdly large height (10x the width):

Code: Select all

magick -size 3450x345000 g4:alex-039.ccitt info:
which starts with

Code: Select all

magick: Premature EOL at line 5200
giving me the correct height.