sk-spell

podpora slovenčiny v Open Source programoch

list of filenames as tesseract-ocr input   

posledná zmena: 2. July 2013


back to tesseract-ocr-en

While I was playing with tesseract-ocr API I found there is possibility to use text file with image filenames as input instead of image file. This feature could be useful if you want to OCR more images.

Unfortunately it is not accessible in tesseract-ocr 3.02.02 version executable because of test for supported image type by leptonica. This should be fixed by svn revision 855 so if you build tesseract-ocr from svn you can use command like this:

tesseract my_list_of_files output

where my_list_of_files could look like this:

eurotext.tif
phototest.tif

Output for both images will be in one file – output.txt. That means you can use my_list_of_files instead of multipage tif (e.g. if you prefer png or have problem to generate tiff suitable for leptonica)

Of course you can create script with a loop (e.g. call tesseract executable in for each image file), but it will be slower because of each time tesseract needs to be initialized.


© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]