How to remove MS Word Spellcheck wavy lines

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Locked
gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

How to remove MS Word Spellcheck wavy lines

Post by gddolph »

Hello all,

I am using imagemagick and the textcleaner script to preprocess image files for tesseract OCR, and while I'm having good success so far I've run into one problem that I need some help to solve. One of the use cases I have is to process text from screenshots, and I'm finding that tesseract getting confused by the red or green wavy underlines MS Word adds for spelling and grammar errors. Here's an example:
Image

I am looking for a way to get rid of the red and blue lines, and I have had some success with the following command set:

Code: Select all

[devbox@fraitcf1vd1998 images]$ convert 20170110/deliberate_mistakes1.png -sharpen 0x1.0 -fuzz 30% -fill white -opaque 'rgb(255,0,0)' \
> -opaque 'rgb(0,0,255)' -scale 200% miff:- |\
> ./textcleaner -g -e stretch -f 50 -o 10 -s 1 - png:- |\
> tesseract - stdout
This is some random text with a missspelt word im it and, grammar mistake.
So it works, only missing getting the N in in wrong. Looking at it incrementally the convert command gives me this:
Image
And the textcleaner section gives me this:
Image

The problem I have is that this is a blunt instrument which changes all instances of red or blue into white, if all I wanted to do was read black text on white backgrounds then I'd be overjoyed by this solution, however I will have images with text in multiple colors.

One approach I've thought about is trying to detect the wavy line shape as it is very distinctive, and I'm thinking morphology might do it for me, but I have to confess I'm lost with the documentation.

Does anyone have a suggested approach and/or code?
Last edited by gddolph on 2017-01-11T03:09:55-07:00, edited 5 times in total.

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: How to remove MS Word Spellcheck wavy lines

Post by fmw42 »

Your image are either too small or the font size is too small. So we cannot even read them. I doubt that you could OCR such images. Can you provide and image that has better resolution?

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: How to remove MS Word Spellcheck wavy lines

Post by snibgo »

Clicking on the image, then the download button, improves the size, but not by much. The capital letter height (eg of "T") is 10 pixels, which in my experience is unlikely to get good results from Tesseract. It says your first image is:

Code: Select all

ms I5 some rzndum text with 2 ymsssgen ward m n and grammar mistake.
Sadly, the blue wavy line beneath "grammar" overwrites part of the base of the "g". If you removed the blue line, you would need to "know" that some of it should be replaced with black or gray instead of white.

Ignoring those problems, the problem is fairly simple. Words have wide gaps between them, and letters have small gaps. The wavy lines mostly (always?) span entire words, so are mostly wider than individual letters. So the task is to isolate graphic objects that are wider than a certain width, and paint them over with white.
snibgo's IM pages: im.snibgo.com

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

fmw42 wrote:
2017-01-10T11:31:12-07:00
Your image are either too small or the font size is too small. So we cannot even read them. I doubt that you could OCR such images. Can you provide and image that has better resolution?
Sorry about that, I've used a different image upload site and now the images are scaled properly on the page.

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

snibgo wrote:
2017-01-10T15:04:50-07:00
Clicking on the image, then the download button, improves the size, but not by much. The capital letter height (eg of "T") is 10 pixels, which in my experience is unlikely to get good results from Tesseract. It says your first image is:

Code: Select all

ms I5 some rzndum text with 2 ymsssgen ward m n and grammar mistake.
Sadly, the blue wavy line beneath "grammar" overwrites part of the base of the "g". If you removed the blue line, you would need to "know" that some of it should be replaced with black or gray instead of white.

Ignoring those problems, the problem is fairly simple. Words have wide gaps between them, and letters have small gaps. The wavy lines mostly (always?) span entire words, so are mostly wider than individual letters. So the task is to isolate graphic objects that are wider than a certain width, and paint them over with white.
Hi @snibgo, thanks for your response. The images I originally had on the post didn't scale, I've fixed that by using a different site. I've had no problems using tesseract on the properly scaled image, I've edited my main post so that the images are scaled correctly and included my complete command line including tesseract and the results.

I like your idea of isolating objects that are wider, I'm not sure about how to do that. I've tried using morphology but everything I do ends up being a mess, or deleting everything! Do you have any suggestions on how to do it?

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: How to remove MS Word Spellcheck wavy lines

Post by snibgo »

The following removes the worst of the lines, where they are darker in any channel than 77% of maximum, and 20 or more pixels wide. (Letters are about 10 pixels wide.) Textcleaner may take care of the remaining bits.

Windows BAT syntax. For Bash, change ^ to \, and %% to %, and the syntax of the environment variables.

Code: Select all

set SRC=doXrJR.png

set LEN=20

%IM%convert ^
  %SRC% -write mpr:ORG ^
  -channel RGB ^
  -threshold 77%% ^
  +channel ^
  -write mpr:MSK ^
  -colorspace Gray -threshold 50%% ^
  ( +clone ^
    -negate ^
    -morphology Erode rectangle:%LEN%x1 ^
    -mask mpr:MSK -morphology Dilate rectangle:%LEN%x1 ^
    +mask ^
    -threshold 0 ^
  ) ^
  -delete 0 ^
  mpr:ORG ^
  +swap ^
  -compose Lighten -composite ^
  out.png
snibgo's IM pages: im.snibgo.com

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

snibgo wrote:
2017-01-11T04:57:25-07:00
The following removes the worst of the lines, where they are darker in any channel than 77% of maximum, and 20 or more pixels wide. (Letters are about 10 pixels wide.) Textcleaner may take care of the remaining bits.

Windows BAT syntax. For Bash, change ^ to \, and %% to %, and the syntax of the environment variables.

Code: Select all

set SRC=doXrJR.png

set LEN=20

%IM%convert ^
  %SRC% -write mpr:ORG ^
  -channel RGB ^
  -threshold 77%% ^
  +channel ^
  -write mpr:MSK ^
  -colorspace Gray -threshold 50%% ^
  ( +clone ^
    -negate ^
    -morphology Erode rectangle:%LEN%x1 ^
    -mask mpr:MSK -morphology Dilate rectangle:%LEN%x1 ^
    +mask ^
    -threshold 0 ^
  ) ^
  -delete 0 ^
  mpr:ORG ^
  +swap ^
  -compose Lighten -composite ^
  out.png
Thanks @snibgo, I've tried that code, it does remove some of the line, but it does leave some traces. Reducing the SET value to 12 made it work better, but leaves some still. I'll play around with this some and see if I can get it to work.

User avatar
anthony
Posts: 8884
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How to remove MS Word Spellcheck wavy lines

Post by anthony »

Nice use of morphology... Exactly what it is meant for ;-)

Use Erode to locate image parts that match the given kernel
set a conditional dilation mask, and dilate the result back to the matching lines
remove the found lines.

It might be improved by replacing the kernel with a DIY kernel of the 'wavyline', so that it better matches the lines MS word adds.

If the kernel matches the lines more closely, you will get a tighter match, and perhaps avoid the use of conditional (masked) dilation. That means the erode-dilate steps will become the simpler 'Open' morphology equivalent.

You can generate a DIY kernel from an image using the "image2kernel" script I wrote for another problem (see Drawing Symbols.

I have updated the Morphology DIY user kernels section to demonstrate using that script, to generate morphicological kernels.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

anthony wrote:
2017-01-11T18:25:06-07:00
Nice use of morphology... Exactly what it is meant for ;-)

Use Erode to locate image parts that match the given kernel
set a conditional dilation mask, and dilate the result back to the matching lines
remove the found lines.

It might be improved by replacing the kernel with a DIY kernel of the 'wavyline', so that it better matches the lines MS word adds.

If the kernel matches the lines more closely, you will get a tighter match, and perhaps avoid the use of conditional (masked) dilation. That means the erode-dilate steps will become the simpler 'Open' morphology equivalent.

You can generate a DIY kernel from an image using the "image2kernel" script I wrote for another problem (see Drawing Symbols.

I have updated the Morphology DIY user kernels section to demonstrate using that script, to generate morphicological kernels.
Thanks Anthony, that image2kernel script is exactly what I needed. I've created a grayscale kernel file using that script and called it ms_wavy_kernel.dat. I've plugged into my command line similar to the syntax in the Alternatives to Symbols section as follows, but I get an error and I'm not sure what I'm doing wrong:

Code: Select all

$   convert 20170110/word_fonts_1.png -write mpr:ORG -channel RGB -threshold 77% \
>   +channel -write mpr:MSK -colorspace Gray -threshold 50% +clone -negate \
>     -morphology Erode @ms_wavy_kernel.dat \
> -mask mpr:MSK -morphology Dilate @ms_wavy_kernel.dat \
>     +mask -threshold 0  -delete 0 mpr:ORG +swap -compose Lighten -composite \
>   word_fonts_1_conv1.png
Failed to parse kernel number #0
convert: invalid argument for option `-morphology': @ms_wavy_kernel.dat @ error/convert.c/ConvertImageCommand/2045.
The kernel file is in the same directory as the script, although I have given it the full path as well. Taking the @ symbols off doesn't help either. It's probably something simple, what am I missing?

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: How to remove MS Word Spellcheck wavy lines

Post by snibgo »

The error is "Failed to parse kernel number #0".

That would be the first kernel in ms_wavy_kernel.dat. Which you might show us.
snibgo's IM pages: im.snibgo.com

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

snibgo wrote:
2017-01-12T07:31:58-07:00
The error is "Failed to parse kernel number #0".

That would be the first kernel in ms_wavy_kernel.dat. Which you might show us.
Hi snibgo, can do. It's pretty long. I did a grayscale kernel, so it's one file:

Code: Select all

24x8:
0.59375
0.59375
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.59375
0.59375
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.5
0.5
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.99609375
0.99609375
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.99609375
0.99609375
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
I used

Code: Select all

convert -scale 200% -type grayscale
on one of my images, then I copied a 24x8 pixel example of the wavy line to a separate png which I used image2kernel -g on to create the kernel above. Here's the image I used.
Image

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: How to remove MS Word Spellcheck wavy lines

Post by snibgo »

That parses without error for me, v6.9.5-3 and v7.0.3-5.
snibgo's IM pages: im.snibgo.com

gddolph
Posts: 7
Joined: 2017-01-10T08:36:01-07:00
Authentication code: 1151

Re: How to remove MS Word Spellcheck wavy lines

Post by gddolph »

snibgo wrote:
2017-01-12T08:25:38-07:00
That parses without error for me, v6.9.5-3 and v7.0.3-5.
Embarrassingly the problem was my version. The system had 6.7.8-9 installed, once I upgraded to 7.0.4-3 the command worked. I say worked, as in it didn't fail, but it didn't remove anything. The reason is that it wasn't matching anything, because the morphology was being run on a negated image. I negated the image and re-ran image2kernel and then used that kernel, which worked, giving me this image:
Image

The only thing is it's slow, on a larger screenshot it took several seconds to run, which could be a problem given I need to automate this to process 100k images per day. Still, it's a great start!

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: How to remove MS Word Spellcheck wavy lines

Post by fmw42 »

You could try -connected-components and throw out any regions that are not a shade of gray. See http://magick.imagemagick.org/script/co ... onents.php

User avatar
anthony
Posts: 8884
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How to remove MS Word Spellcheck wavy lines

Post by anthony »

Hmmm ... The kernel generated is for a convolution, not a dilatation. I am also not sure if it is inverted.

After making the conversions, and reformatting so as to make it more 'human readable' the resulting kernel did not look much like a wavy line, but more like a hash pattern!

Code: Select all

24x8:
1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
- - 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 
- - 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
- - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
- - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
This is obviously not a wavy line.


Grab the newer version of the image2kernel script and use the flags -gm on a white on black copy of the image so as to generate the right type of kernel.

The 'm' flag has the script convert image into a morphological kernel (thresholded values of '1' and '-', the latter meaning not part of neighbourhood). It should work better in matching edges of the wavey lines.

See the new 'flag' examples at the bottom of...
http://www.imagemagick.org/Usage/morphology/#user
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/

Locked