Page 1 of 1

[RESOLVED] ImageMagick identify crashes with pdf

Posted: 2015-02-06T05:55:42-07:00
by lost_in_binary
Hello and thank you for this wonderful tool,

I seem to be having a problem while opening a pdf file with imagemagick.
The file is here:
https://www.dropbox.com/s/nzox2h7khutw2 ... 2.pdf?dl=0

I found out the problem when trying to split the pdf into one image per page but the problem is still noticeable with a plain "identify form_advanced2.pdf" command which crashes the system.

I have tried other forms as well and they all work fine. I have tried version ImageMagick 6.7.9-10 (windows) and on ubuntu I tried the default version that one can install from the official repositories of ubuntu 14.04 and the latest ImageMagick 6.9.0-4 Q16 x86_64 (installed from source). All of them appear to be crashing 'identify' for this specific pdf.
Is there any solution to my problem?

Thank you

Re: ImageMagick identify crashes with pdf

Posted: 2015-02-06T07:43:40-07:00
by pipitas
For investigating a PDF file, ImageMagick is the totally wrong tool. All ImageMagick commands employ 'delegates' to handle input PDFs. They use Ghostscript first to convert all PDF pages to raster images. Only after that step they take their first look at raster images to do something with it. Not even `identify` will look at the PDF directly.

To investigate a problematic PDF, use tools which are designed to do so. My first choices are some utilities from the 'Poppler' fork of XPDF. When these do not lead me to concluding results, I would employ others. But let's first start with these three:
  • pdfinfo
  • pdfimages
  • pdffonts
'pdfinfo' tells you everything about the file's meta data, and does so very quickly:

Code: Select all

pdfinfo -meta -box -js form_advanced2.pdf 
Producer:       iPhone OS 8.0 Quartz PDFContext
CreationDate:   Wed Jan 28 17:28:28 2015
ModDate:        Fri Feb  6 14:38:00 2015
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      768 x 1526 pts
Page rot:       0
MediaBox:           0.00     0.00   768.00  1526.00
CropBox:            0.00     0.00   768.00  1526.00
BleedBox:           0.00     0.00   768.00  1526.00
TrimBox:            0.00     0.00   768.00  1526.00
ArtBox:             0.00     0.00   768.00  1526.00
File size:      142299 bytes
Optimized:      yes
PDF version:    1.6
Metadata:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <xmp:CreateDate>2015-01-28T17:28:28Z</xmp:CreateDate>
         <xmp:ModifyDate>2015-02-06T14:38+02:00</xmp:ModifyDate>
         <xmp:MetadataDate>2015-02-06T14:38+02:00</xmp:MetadataDate>
         <pdf:Producer>iPhone OS 8.0 Quartz PDFContext</pdf:Producer>
         <xmpMM:DocumentID>uuid:d192522a-d623-4c39-b6ec-6728337325bd</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:a0dc7e34-37e3-4541-a2c5-383a4c5e6602</xmpMM:InstanceID>
         <dc:format>application/pdf</dc:format>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>


<?xpacket end="w"?>
This has me already wondering a bit: How can iOS 8.0 Quartz PDFContext create a PDF-1.6 version, when even Mac OS X Mavericks 10.9.5 is unable to produce PDF-1.4 and still remains on PDF-1.3 ??? (It may be legit -- I'm not a developer nor an iOS expert, but it still leaves me wondering...)

Since the result line: 'JavaScript: no' does not indicate that there is JavaScript in the PDF (unless it is a malicious file, where the JavaScript is hidden and obfuscated!), we can rule out this as a cause for your observed crashes.

'pdffonts' gives us some hints about the used fonts (if any) inside the PDF:

Code: Select all

pdffonts   form_advanced2.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
There are no fonts used by this PDF. Which means its page contents can only consist of vector shapes or pixel graphics (if not empty).

'pdfimages -list' will report details about all (raster) images contained in the PDF file:

Code: Select all

pdfimages -list form_advanced2.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1440   124  rgb     3   8  image  yes       50  0   144   144 2663B 0.5%
   1     1 image     602    62  rgb     3   8  image  yes       52  0   144   144  510B 0.5%
   1     2 smask     602    62  gray    1   8  image  yes       52  0   144   144  186B 0.5%
   1     3 image      32    32  rgb     3   8  image  yes       19  0   144   144  401B  13%
   1     4 smask      32    32  gray    1   8  image  yes       19  0   144   144  222B  22%
   1     5 image      32    32  rgb     3   8  image  yes       19  0 0.505   144  401B  13%
   1     6 smask      32    32  gray    1   8  image  yes       19  0 0.505   144  222B  22%
   1     7 image      32    32  rgb     3   8  image  yes       19  0   144   144  401B  13%
   1     8 smask      32    32  gray    1   8  image  yes       19  0   144   144  222B  22%
   1     9 image      32    32  rgb     3   8  image  yes       19  0   144 0.596  401B  13%
   1    10 smask      32    32  gray    1   8  image  yes       19  0   144 0.596  222B  22%
   1    11 image      32    32  rgb     3   8  image  yes       19  0 0.505 0.596  401B  13%
   1    12 smask      32    32  gray    1   8  image  yes       19  0 0.505 0.596  222B  22%
   1    13 image      32    32  rgb     3   8  image  yes       19  0   144 0.596  401B  13%
   1    14 smask      32    32  gray    1   8  image  yes       19  0   144 0.596  222B  22%
   1    15 image      32    32  rgb     3   8  image  yes       19  0   144   144  401B  13%
   1    16 smask      32    32  gray    1   8  image  yes       19  0   144   144  222B  22%
   1    17 image      32    32  rgb     3   8  image  yes       19  0 0.505   144  401B  13%
   1    18 smask      32    32  gray    1   8  image  yes       19  0 0.505   144  222B  22%
   1    19 image      32    32  rgb     3   8  image  yes       19  0   144   144  401B  13%
   1    20 smask      32    32  gray    1   8  image  yes       19  0   144   144  222B  22%
   1    21 image     242    42  rgb     3   8  image  yes       72  0   144   144  157B 0.5%
   1    22 smask     242    42  gray    1   8  image  yes       72  0   144   144 1871B  18%
   1    23 image     226    62  rgb     3   8  image  yes       88  0   144   144  205B 0.5%
   1    24 smask     226    62  gray    1   8  image  yes       88  0   144   144   85B 0.6%
   1    25 image      32    32  rgb     3   8  image  yes       19  0   144   144  401B  13%
   1    26 smask      32    32  gray    1   8  image  yes       19  0   144   144  222B  22%
  [....]
   1   685 smask      32    32  gray    1   8  image  yes       19  0   145   144  222B  22%
   1   686 image     420    42  rgb     3   8  image  yes      148  0   144   144  254B 0.5%
   1   687 smask     420    42  gray    1   8  image  yes      148  0   144   144 2229B  13%
   1   688 image     172    42  rgb     3   8  image  yes      124  0   144   144  117B 0.5%
   1   689 smask     172    42  gray    1   8  image  yes      124  0   144   144 1499B  21%
   1   690 image    1536    88  rgb     3   8  image  yes      150  0   144   144 1813B 0.4%
   1   691 smask    1536    88  gray    1   8  image  yes      150  0   144   144  636B 0.5%
   1   692 image    1440   120  rgb     3   8  image  yes      152  0   144   144 2284B 0.4%
   1   693 smask    1440   120  gray    1   8  image  yes      152  0   144   144 3041B 1.8%
   1   694 image    1536    88  rgb     3   8  image  yes      150  0   144   144 1813B 0.4%
   1   695 smask    1536    88  gray    1   8  image  yes      150  0   144   144  636B 0.5%
   1   696 image    1440   120  rgb     3   8  image  yes      154  0   144   144 2284B 0.4%
   1   697 smask    1440   120  gray    1   8  image  yes      154  0   144   144 9.78K 5.8%
   1   698 image    1536    48  rgb     3   8  image  yes      156  0   144   144  988B 0.4%
   1   699 smask    1536    48  gray    1   8  image  yes      156  0   144   144  344B 0.5%
   1   700 image    1536    88  rgb     3   8  image  yes      150  0   144   144 1813B 0.4%
   1   701 smask    1536    88  gray    1   8  image  yes      150  0   144   144  636B 0.5%
   1   702 image    1440   120  rgb     3   8  image  yes      158  0   144   144 2284B 0.4%
   1   703 smask    1440   120  gray    1   8  image  yes      158  0   144   144 2957B 1.7%
   1   704 image    1536    48  rgb     3   8  image  yes      156  0   144   144  988B 0.4%
   1   705 smask    1536    48  gray    1   8  image  yes      156  0   144   144  344B 0.5%
   1   706 image    1536    88  rgb     3   8  image  yes      150  0   144   144 1813B 0.4%
   1   707 smask    1536    88  gray    1   8  image  yes      150  0   144   144  636B 0.5%
   1   708 image    1448   134  rgb     3   8  image  yes      160  0   144   144 2562B 0.4%
   1   709 smask    1448   134  gray    1   8  image  yes      160  0   144   144 4645B 2.4%
Ok now! This one-page PDF seems to contain the stupid/insane number of 710 different images (some being used as soft masks, some being really visible images)!

This first impression is a bit misleading when looking at the first 3 columns only. We also have to take into account the columns headed `object ID`. If these indicate the use of different PDF object IDs for each instance of an image, then the internal construction of the PDF indeed is 'stupid' (or indicates that the developer of the PDF generating application was at the beginning of that part of his professional career which has to deal with PDF related tasks...)

So let's see and count how many different object IDs are there, and their respective frequencies:

Code: Select all

pdfimages -list  form_advanced2.pdf | grep -vE '(object ID|---)' | awk '{print $11, $12}' | sort | uniq  -c | sort -g 
   2 100 0
   2 102 0
   2 104 0
   2 106 0
   2 108 0
   2 110 0
   2 116 0
   2 122 0
   2 126 0
   2 128 0
   2 130 0
   2 132 0
   2 134 0
   2 136 0
   2 138 0
   2 140 0
   2 142 0
   2 144 0
   2 146 0
   2 148 0
   2 152 0
   2 154 0
   2 158 0
   2 160 0
   2  52 0
   2  54 0
   2  56 0
   2  58 0
   2  62 0
   2  64 0
   2  66 0
   2  68 0
   2  70 0
   2  72 0
   2  74 0
   2  76 0
   2  78 0
   2  84 0
   2  86 0
   2  90 0
   2  92 0
   2  94 0
   2  98 0
   4 112 0
   4 114 0
   4 118 0
   4 120 0
   4 124 0
   4 156 0
   4  96 0
   6  60 0
   6  82 0
   6  88 0
   8 150 0
  14  50 0
  18  80 0
 538  19 0
The last line indicates that PDF object number 19 is embedded only once in the file, but used at 538 different locations. At least it's not embedded 538 times then!

Anyway, I'll not continue to analyse this PDF file. Just a few more hints:

Code: Select all

mkdir form_adv2

pdfimages -j form_advanced2.pdf somedir/form_adv2
This command creates a sub directory named 'form_adv2' and extracts all instances of images found in the PDF into this dir. Attention: the command will extract multiple copies when an image is reused multiple times inside the PDF!. The filenames will be 'form_adv2-000.*', 'form_adv2-001.*', 'form_adv2-002.*', ... (matching the image numbers from the previously printed list). The '-j' parameter ordered the extraction of JPEG files, which however is not always possible. If JPEGs are not possible to extract, it the suffixes of the file names will not be *.jpg, but *.ppm or *.pbm and the files will be uncompressed rasters. In this case you can still use ImageMagick and convert to get JPEGs for further analysis, if you want.

----

I guess that your version of Ghostscript is just not able to handle that PDF correctly...

Re: ImageMagick identify crashes with pdf

Posted: 2015-02-09T03:50:14-07:00
by lost_in_binary
Thank you. This must be it then, even though I am using the latest version of Ghostscript (9.3) I still get this error. I switched over to Java with Apache PDFBox and it gets opened and saved.
Thanks for your time.

Edit: pipitas is right. I meant gs 9.15 instead of 9.3

Re: ImageMagick identify crashes with pdf

Posted: 2015-02-09T07:30:50-07:00
by pipitas
lost_in_binary wrote:Thank you. This must be it then, even though I am using the latest version of Ghostscript (9.3) I still get this error.
There is no Ghostscript 9.3. The latest released version is v9.15, the upcoming version likely will get named v9.16.