GOCR Tute
This is a quick tutorial in how to use GOCR.
What is GOCR?
GOCR is an OCR
(Optical Character Recognition) program, developed under the
GNU Public License.
It converts scanned images of text back to text files.
Windows Tutorial
Disclaimer: I'm writing this tute from the perspective of a Windows user. If
someone adds a Linux version too, that'd be great. I don't claim
to offer an exhaustive or authorative text, only that I had to
fiddle for 5-20min to use this program, so I might as well share
what I learnt. This may not be the easiest or best way of doing
things, but it's the one I've discovered.
- Download GOCR for windows. (BTW it's probably worth saving it in a new directory like
c:\gocr
).
- The only real complication is getting the image into a format which GOCR can read. The formats it likes most are
pnm
and pbm
. Here's one way of going about it:
- Convert the image to
jpg
format using whatever you can get your hands on. Even better, save it in that format when you first scan it in (preferably at a resolution of at least 200dpi, but better at 300dpi).
- Download djpeg, a tool for converting jpegs to other formats (save it in the same directory as gocr.exe, for convenience)
- Use djpeg to convert the
jpg
to pnm
format:
- Start the command prompt. (Start/Run
cmd
or command
)
- Change to the directory djpeg is in, by typing
cd\gocr
(if c:\gocr
is the directory you stored it in).
- Type
djpeg -greyscale -dither none c:\pix\picture.jpg c:\pix\picture.pnm
(where c:\pix\picture.jpg
is your scanned document in jpg
format). This should create a pnm
version of the picture called picture.pnm
- Use GOCR to convert the
pnm
image to text:
- Start the command prompt as above, unless it's already running.
- Change to the directory gocr is in as above, unless you're already there.
- Type
gocr -i c:\pix\picture.pnm -o c:\text\stuff.txt
(where c:\pix\picture.pnm
is your scanned document in pnm
format, and c:\text\stuff.txt
is the text file you want it to dump the text in). This should create a text file c:\text\stuff.txt
with the results in.
- Delete the image files
picture.pnm
and picture.jpg
(unless you want to keep them for anything, that is).
If you want, you can speed the whole process up by writing a batch file to do
both stages at once and then clean up afterwards too.
--
TWikiGuest - 14 May 2004