Grayscale background removal for OCR

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
SteelMassimo
Posts: 16
Joined: 2019-02-06T11:29:18-07:00
Authentication code: 1152

Grayscale background removal for OCR

Post by SteelMassimo »

Hello everyone!

I'm having the following issue:

I need to prepare a huge quantity of images for OCR reading. Problem is, the information is written in a part of a document that has a grayscale background. Also, the scanner that produced this image did so in color, so when the OCR tries to read it, it make a mess out of it.

I tried the following commands, but the result came with a lot of noise, where the grayscale background used to be.

Code: Select all

convert 0068_example.jpg -type Grayscale -brightness-contrast +15x100 0068_result.jpg
The original image: https://drive.google.com/open?id=1ugHSQ ... zOYsahLVE9

The result: https://drive.google.com/open?id=1P1j9B ... k3284sFBVk

I've also tried converting to B&W and then blurring a bit to avoid too much pixalated images, but the results were inconsistent.

Any suggestions in order to make it cleaner for OCR reading ?

Thanks!

IM version: 7.0.8-27-Q16-x64
OS: Windows 7 Pro 64-bits.
Version date: 2019-01-27.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Grayscale background removal for OCR

Post by fmw42 »

Perhaps something like this will help. https://stackoverflow.com/questions/530 ... 9#53016979
SteelMassimo
Posts: 16
Joined: 2019-02-06T11:29:18-07:00
Authentication code: 1152

Re: Grayscale background removal for OCR

Post by SteelMassimo »

Hello again fmw42!

Here's what I did, since my problem differs a little:

Removed the lines that concern trimming, thinning and border color. It went something like this:

Code: Select all

convert 0068_example.jpg -type Grayscale ^
-fuzz 22% ^
-define connected-components:remove=0 ^
-define connected-components:mean-color=true ^
-connected-components 4 ^
-background white -flatten ^
result_example.jpg
And here's the output: https://drive.google.com/file/d/1S7TdmN ... sp=sharing

The list of objects found by -define connected-components is way too massive to be of any use. Still, here's the link for it:

https://drive.google.com/file/d/1Jhp8ps ... sp=sharing

The image is still with a lot of noise from the gray background strip. What modifications should I do in order to make it recognize the objetcs that I want to remove better?
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Grayscale background removal for OCR

Post by fmw42 »

try something like this

Code: Select all

convert 0068_example.jpg -type Grayscale -threshold 25% ^
-define connected-components:area-threshold=5 ^
-define connected-components:mean-color=true ^
-connected-components 4 ^
result.png
You may have to remove the long horizontal lines and the table lines. See https://stackoverflow.com/questions/540 ... 6#54044746

I strongly suggest that you not save to JPG. That will just degrade your image more so that OCR is harder. Also if possible do not scan to JPG (or PDF). Use TIFF (LZW compressed if possible) or PNG if possible.
SteelMassimo
Posts: 16
Joined: 2019-02-06T11:29:18-07:00
Authentication code: 1152

Re: Grayscale background removal for OCR

Post by SteelMassimo »

Hey fmw42!

Worked like a charm. After some tweaks with the threshold and the area for connected components, the code is good enough for OCR reading.

No need to remove the black line, it does not interfere. Thanks for the help!

Final code:

Code: Select all

convert 0001.jpg -type Grayscale -threshold 35% -define connected-components:area-threshold=3 -define connected-components:mean-color=true -connected-components 8 0001.jpg
Post Reply