PDF to image causes 1st page margin

IMagick is a native PHP extension to create and modify images using the ImageMagick API. ImageMagick Studio LLC did not write nor does it maintain the IMagick extension, however, IMagick users are welcome to discuss the extension here.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

PDF to image causes 1st page margin

Post by BigLittle »

I am trying to convert a PDF to a jpg/png. They need to be exact every time, and the first page of the PDF is producing a margin on the right as shown in the attached picture. Anyone have any ideas why this might be happening?

[php]
$Page = new Imagick();
$Page -> setResolution(450,450);
$Page -> readImage("temp2.pdf");
$Count = $Page -> getNumberImages();
for($C = 0; $C < $Count; $C++) {
$Page -> readImage("temp2.pdf[$C]");
$Page -> setImageFormat('png');
$Page -> writeImages("Results-$C.png", false);
}
[/php]

I also used the below code first, but neither works.
[php]
$Page = new Imagick();
$Page -> setResolution(450,450);
$Page -> readImage("temp2.pdf");
$Page -> setImageFormat('png');
$Page -> writeImages("Results.png", false);
[/php]
Attachments
margin.jpg
margin.jpg (39.42 KiB) Viewed 14663 times
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

Just some guesses. The pdf may have a larger canvas than the image it contains. If this is a virtual canvas, then it can be removed by adding +repage after reading the image. see http://www.imagemagick.org/script/comma ... php#repage

However, I am more inclined to believe that the pdf has a clip path that IM is not using. See http://www.imagemagick.org/script/comma ... #clip-path and http://www.imagemagick.org/script/comma ... s.php#clip

Can you post either your pdf file or the results from the command line command of:

identify -verbose image.pdf

You should be able to get that from PHP exec() on the command.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

Thanks for the help! I have attached the PDF and the result PNGs. The PDF is zipped to allow for upload.

I'm not real sure how to apply repage or clip with the PHP I'm using. Just run an exec right after reading the image? If either of those is the issue, could you show me example code to mix in with mine?
Attachments
Examples.zip
(134.21 KiB) Downloaded 626 times
Page 2.png
Page 2.png (174.9 KiB) Viewed 14644 times
Page 1.png
Page 1.png (298.7 KiB) Viewed 14644 times
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

I do not see any extra margins in the files you have posted. I also downloaded your PDF and converted to PNG and I do not see any margins in my conversion either. The verbose information from your file shows that both pages of the pdf are size 614x1008 and there is a pdf:HiResBoundingBox: 614x1008. So they both match.

Please clarify about the margins.

When converting PDF to any other format, Imagemagick used the Ghostscript delegate library. You may want to try upgrading that if it is old and try again. Also if going to PNG, you may want to upgrade the libpng delegate also.

On my system:

libpng @1.4.11_0 (which is not the most current)
ghostscript @9.06_1


Try doing the conversion using PHP exec() command and see if you see the same issues? If so, then one of the delegates may need upgrading or Imagemagick. If not, then that would point to Imagick.

What version of Imagemagick are you using and what version of Imagick?
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

You'll have to save the image as the margin doesn't show in the browser. I saved it via browser and opened it in Fireworks and I can see it.

I just installed it yesterday with yum, so I believe they are up to date. I'll check on the GS too, and try out the exec method.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

Attached is another example of what I'm seeing in Fireworks. The canvas sizes are the same, but the first page's document is a different size.
Attachments
example.jpg
example.jpg (71.36 KiB) Viewed 14633 times
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

And to add a bit more... I did try : exec('gs -q -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=temp3.tif temp2.pdf -c quit');

This converted the PDF to the same exact size. I really don't care how it's done as long as I get the same results every time. This works, although I don't know how to change the resolution. It would still be interesting to know why the margin was produced with IM.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

BigLittle wrote:You'll have to save the image as the margin doesn't show in the browser. I saved it via browser and opened it in Fireworks and I can see it.

I just installed it yesterday with yum, so I believe they are up to date. I'll check on the GS too, and try out the exec method.

I did download the images and they looked fine for me. Perhaps it is your Fireworks viewer.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

I looked again in another viewer and it does show the margin. What I think is happening from looking at the verbose information is that your pdf has a section of transparency in it. So when you convert it you want to flatten it against a white background.

convert image.pdf -background white -flatten image.png

See if that helps.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

I attached the result. It looks like it replaces the margin with a white background. This is still not possible to work with, since I need the document to be the same conversion at the second. If all the pages had the same margin, it wouldn't matter. It's really odd that this is happening, because it didn't happen before. I'm going to try to determine what changed, but if I can't I may just use ghostscript.
Attachments
image.tiff
image.tiff (76.19 KiB) Viewed 14622 times
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

Well I did determine that it is something unique about the PDF that's causing it. Attached are two PDFs. Dumb.pdf is one that produces a margin, and Dumb2.pdf does not. It does it with this conversion: gs -q -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=O4.png dumb.pdf -c quit

Any ideas why?
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

I think it is because in the bad one, there is transparency (alpha channel) and perhaps in the other there is not.

I would think that both pages from your first example would end up the same size with white filled for transparency. IM verbose information shows the pdf pages and resulting pngs images being the same size. So I am not sure what is still the issue.

Note that at one time, I was told that IM could not handle multipage images with transparency. You had to choose an sDevice (pngalpha or pnmraw) in the delegates.xml file to properly use one or the other situation. pngalpha was used if you had only one page with transparency and pnmraw was used if you wanted to process multipage but not transparency pdfs. This may have changed, but it is the device that is sent to ghostscript to do the processing. So it was likely a Ghostscript issue. But perhaps with more recent versions of Ghostscript that may no longer be the case.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

My goal of this was to overlay an inverted watermark and drown it out enough so I could OCR a bunch of document for research. The watermark is there for printing the documents, which I don't care about.

Most people run from "remove watermark", but I'm not using for the purpose of the watermark. I may have to try to find better OCR software or something, but if there is a way to drown the watermark enough to OCR it that would probably be the better solution.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: PDF to image causes 1st page margin

Post by fmw42 »

Do you want to enhance or remove the watermark?

If you want to remove it, try using -morphology operators if the watermark dots are small enough. The image you show is so low a resolution I would doubt that you could even OCR the text if the watermark was removed. Do you have an higher resolution image or the source file? If so post a link to that not inside <code>...</code> tags so it is easy to download.


I tried working with your first example, but got nowhere. Do you have a blank page with the watermark on it? If so you can use that to remove (via -compose subtract or -compose divide ) the watermark in the text image.
BigLittle
Posts: 13
Joined: 2013-08-08T18:32:14-07:00
Authentication code: 6789

Re: PDF to image causes 1st page margin

Post by BigLittle »

I'm trying to remove it enough for OCR to read the text correctly. I uploaded a compressed folder of all the documents at http://www.filedropper.com/files_2
Sorry for the download site, but it's 9MB and I couldn't upload it here. *Be sure to click the gray "Download this file" and not the big green "Start Download" ad button.


The Result2-0.jpg is a completed merge and how it's supposed to look. It OCRs perfectly. The other two were off a bit so it just makes it worse and the OCR is horrible where the watermark passes through text. This is the code is used to do it:

Code: Select all

$Page = new Imagick();
	$Page -> setResolution(500,500);
	$Page -> readImage("temp2.pdf");
	$Count = $Page -> getNumberImages();
	$Page -> setImageFormat('png');
	$Page -> writeImages("Result.png", false);

	for($C = 0; $C < $Count; $C++) {
		// $Page -> contrastImage(5);
		$img = imagecreatefrompng('Result-'.$C.'.png');
		$img2 = imagecreatefrompng('PNGMasterOverlay.png');
		imagecopymerge($img, $img2, 0, 0, 0, 0, imagesx($img), imagesy($img), 50);
		imagejpeg($img, 'Result2-'.$C.'.jpg');
	}
I used GD because I couldn't get it to work with IM.

I know it's possible to remove a watermark, but since the original is a secured PDF I can't find a way to do it.
Post Reply