Page 1 of 1

OCR image preprocessing with ImageMagic

Posted: 2019-01-09T10:26:10-07:00
by milosbre
I am trying to find the best way to clean the image with imageMagic before I send it to tesseract.

So far the best result was given by this combination

Code: Select all

convert test.tif -fill black -fuzz 30% +opaque "#FFFFFF" result.tif
But the results from tesseract aren't so good.

How would you guys do it?

Example image:
Image


Here are the images
https://www.dropbox.com/sh/jyrd58nbrava ... RRQRa?dl=0

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T11:15:44-07:00
by fmw42
What is different about your various tests? Does not test6.tif work to do OCR? If not, have you tried making the background white? You have not posted your original tif file. The file above has been changed to a JPG.

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T11:26:43-07:00
by milosbre
Yes, test6.tif works.
Original image(s) is in that dropbox link (test4.tif).

With the code I provided I get some results but for example, often '5' is missplaced for 'S' and ':' is missplaced for "i" '?' etc.
I'm very new to image preprocessing so I was thinking I'm doing something wrong since I am able to clear the image just fine but tesseract still misses the obvious characters.

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T11:33:16-07:00
by milosbre
I have done a lot of searching and testing and as far as the numbers are concerned, I got perfect results but I completely loose the . and :

Code: Select all

convert test.tif -brightness-contrast -40x10 -units pixelsperinch -density 300 -negate -noise 10 -threshold 70% result.tif
converts this
Image
into this:
Image

And I get perfect number detection, but as you can see "," and ":" are lost.

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T12:25:42-07:00
by fmw42
Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T16:30:02-07:00
by milosbre
fmw42 wrote: 2019-01-09T12:25:42-07:00 Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.
To the eye it looks better but its not actually.
With that first code the numbers are often bad.

Sorry for double posting but I'm just trying to find someone experienced in this to point me in the right direction.
Why isn't tesseract read it right when it looks almost perfect to the eye.

The second code produces bad looking results but it does its magic with the numbers.

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-09T16:51:08-07:00
by milosbre
fmw42 wrote: 2019-01-09T12:25:42-07:00 Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.
If you have any advice on what switches I should use, let me know.

For now I solved the issue by using the tesseract 4.0 alpha. It uses deep learning and so far it works perfectly with my first code

Code: Select all

convert test.tif -fill black -fuzz 30% +opaque "#FFFFFF" result.tif

Re: OCR image preprocessing with ImageMagic

Posted: 2019-01-14T22:27:40-07:00
by anthony
If the image is from a screen dump... Make the image BIGGER....

tesseract is designed for scanned documents at about 600dpi, but displays are typically only 90 to 100dpi so scaling by 600% often works wonders.


I also find some fonts make go bad. For example a 'serif f' will often be thought of as a P
Then there is confusion about Il1 or QO0 which can be solve by limiting the character set tesseract is using.

PS: the last paragraph when screen captured from my web browser display produced...
I also find some fonts make go bad. For example a 'serif f' will often be thought of as a P
Then there is confusion about III or Q00 which can be solve by limiting the character set tesseract is using.
PS: not all the bars came out as letter 'I' and the letter O's as digit zero 0. In another run the 'I's came out a "1
The 'f' had no problem as my web browser is not using a serif font.
Basically tesseract results can need some luck to work well. I am certainly no expert in its use.

Some links I have
http://community.aiim.org/blogs/richard ... d-indexing
https://mathieularose.com/decoding-captchas/