GOCR Tute

This is a quick tutorial in how to use GOCR.

What is GOCR?

GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files.

Windows Tutorial

Disclaimer: I'm writing this tute from the perspective of a Windows user. If someone adds a Linux version too, that'd be great. I don't claim to offer an exhaustive or authorative text, only that I had to fiddle for 5-20min to use this program, so I might as well share what I learnt. This may not be the easiest or best way of doing things, but it's the one I've discovered.

  1. Download GOCR for windows. (BTW it's probably worth saving it in a new directory like c:\gocr).
  2. The only real complication is getting the image into a format which GOCR can read. The formats it likes most are pnm and pbm. Here's one way of going about it:
    1. Convert the image to jpg format using whatever you can get your hands on. Even better, save it in that format when you first scan it in (preferably at a resolution of at least 200dpi, but better at 300dpi).
    2. Download djpeg, a tool for converting jpegs to other formats (save it in the same directory as gocr.exe, for convenience)
    3. Use djpeg to convert the jpg to pnm format:
      1. Start the command prompt. (Start/Run cmd or command)
      2. Change to the directory djpeg is in, by typing cd\gocr (if c:\gocr is the directory you stored it in).
      3. Type djpeg -greyscale -dither none c:\pix\picture.jpg c:\pix\picture.pnm (where c:\pix\picture.jpg is your scanned document in jpg format). This should create a pnm version of the picture called picture.pnm
  3. Use GOCR to convert the pnm image to text:
    1. Start the command prompt as above, unless it's already running.
    2. Change to the directory gocr is in as above, unless you're already there.
    3. Type gocr -i c:\pix\picture.pnm -o c:\text\stuff.txt (where c:\pix\picture.pnm is your scanned document in pnm format, and c:\text\stuff.txt is the text file you want it to dump the text in). This should create a text file c:\text\stuff.txt with the results in.
  4. Delete the image files picture.pnm and picture.jpg (unless you want to keep them for anything, that is).

If you want, you can speed the whole process up by writing a batch file to do both stages at once and then clean up afterwards too.

-- TWikiGuest - 14 May 2004
Topic revision: r1 - 14 May 2004, TWikiGuest
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback