sk-spell

podpora slovenčiny v Open Source programoch

tesseract-ocr-en: language training 3.00   

posledná zmena: 23. April 2010

back to tesseract-ocr-en

Because I was not satisfied with result when I used my data trained for tesseract 2.04, I decided to test training process in tesseract 3.00 and compare it to training process for tesseract 2.04.

make box files

I used training image (test-0001-arial.tif) for creating box file.

$ /usr/local/bin/tesseract test-0001-arial.tif test-0001-arial \
   batch.nochop makebox
$ cp test-0001-arial.txt test-0001-arial.box

Box file was than checked in tesseractTrainer.py, and some symbols were split based on box file created for tesseract 2.04.

run tesseract for training

I started training as described:

$ /usr/local/bin/tesseract test-0001-arial.tif junk nobatch \
   box.train.stderr

but it crashed with short message:
Segmentation fault

After spending some with gdb. I found it crash on this line classify/blobclass.cpp:94. As far as I understood source code: tesseract expect as input something like [lang].[font].exp[num] (The [lang], [fontname] and [num] fields should not have ‘.’ characters.) e.g. slk.arial.001. If you use something else – it will crash. So I renamed my files:

$ mv test-0001-arial.tif  slk.arial.001.tif
$ mv test-0001-arial.box  slk.arial.001.box

and I run training process once again:

$ /usr/local/bin/tesseract test-0001-arial.tif junk nobatch \
   box.train.stderr

It crashed, but with more informations:

Tesseract Open Source OCR Engine with Leptonica
APPLY_BOXES: boxfile 10/1/t ((90,278),(107,316)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:3 “y”
APPLY_BOXES: boxfile 16/1/ý ((301,212),(326,262)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:4 “r”
APPLY_BOXES: boxfile 20/17/ý ((433,156),(458,205)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:5 “v”
APPLY_BOXES: boxfile 23/1// ((965,169),(982,209)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:5 “k”
APPLY_BOXES: boxfile 28/17/} ((755,42),(771,93)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:7 “{”
APPLY_BOXES: Unlabelled word blk:1 row:3 allrows:3
APPLY_BOXES: Unlabelled word blk:1 row:4 allrows:4
APPLY_BOXES: Unlabelled word blk:1 row:5 allrows:5
APPLY_BOXES: Unlabelled word blk:1 row:5 allrows:5
APPLY_BOXES: Unlabelled word blk:1 row:7 allrows:7
APPLY_BOXES: REBALANCE REQD “r [72 ]” – target of 6 from 5 labelled samples
APPLY_BOXES: REBALANCE REQD “k [6b ]” – target of 9 from 8 labelled samples
APPLY_BOXES: REBALANCE REQD “v [76 ]” – target of 7 from 6 labelled samples
APPLY_BOXES: REBALANCE REQD “ý [fd ]” – target of 4 from 2 labelled samples
APPLY_BOXES: REBALANCE REQD “t [74 ]” – target of 6 from 5 labelled samples
APPLY_BOXES: FATALITY – 0 labelled samples of “y [79 ]” – target is 1:
APPLY_BOXES: FATALITY – 1 labelled samples of “{ [7b ]” – target is 2:
APPLY_BOXES: FATALITY – 1 labelled samples of “} [7d ]” – target is 2:
APPLY_BOXES: FATALITY – 0 labelled samples of “/ [2f ]” – target is 1:
APPLY_BOXES: Boxes read from boxfile: 239 Initially labelled blobs: 229 in 7 rows Box failures detected: 10 Duped blobs for rebalance: 6 “y” has fewest samples: 0 Total unlabelled words: 5 Final labelled words: 235
Generating training data
Segmentation fault

Again – gdb showed it crashed on line 94 in classify/blobclass.cpp. After a lot of tests and time to understand source code I found out that tesseract should be run this way for training:

$ /usr/local/bin/tesseract slk.arial.001.tif ./slk.arial.001 \
   nobatch box.train.stderr

For second argument (./slk.arial.001) it is important to use at least one “/“ – I do not know reason (and behavior on other platform as Windows) for this.

Than tesseract creates file slk.arial.001.tr with this messages:

Tesseract Open Source OCR Engine with Leptonica
APPLY_BOXES: boxfile 10/1/t ((90,278),(107,316)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:3 “y”
APPLY_BOXES: boxfile 16/1/ý ((301,212),(326,262)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:4 “r”
APPLY_BOXES: boxfile 20/17/ý ((433,156),(458,205)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:5 “v”
APPLY_BOXES: boxfile 23/1// ((965,169),(982,209)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:5 “k”
APPLY_BOXES: boxfile 28/17/} ((755,42),(771,93)): FAILURE! box overlaps blob in labelled word
APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:7 “{”
APPLY_BOXES: Unlabelled word blk:1 row:3 allrows:3
APPLY_BOXES: Unlabelled word blk:1 row:4 allrows:4
APPLY_BOXES: Unlabelled word blk:1 row:5 allrows:5
APPLY_BOXES: Unlabelled word blk:1 row:5 allrows:5
APPLY_BOXES: Unlabelled word blk:1 row:7 allrows:7
APPLY_BOXES: REBALANCE REQD “r [72 ]” – target of 6 from 5 labelled samples
APPLY_BOXES: REBALANCE REQD “k [6b ]” – target of 9 from 8 labelled samples
APPLY_BOXES: REBALANCE REQD “v [76 ]” – target of 7 from 6 labelled samples
APPLY_BOXES: REBALANCE REQD “ý [fd ]” – target of 4 from 2 labelled samples
APPLY_BOXES: REBALANCE REQD “t [74 ]” – target of 6 from 5 labelled samples
APPLY_BOXES: FATALITY – 0 labelled samples of “y [79 ]” – target is 1:
APPLY_BOXES: FATALITY – 1 labelled samples of “{ [7b ]” – target is 2:
APPLY_BOXES: FATALITY – 1 labelled samples of “} [7d ]” – target is 2:
APPLY_BOXES: FATALITY – 0 labelled samples of “/ [2f ]” – target is 1:
APPLY_BOXES: Boxes read from boxfile: 239 Initially labelled blobs: 229 in 7 rows Box failures detected: 10 Duped blobs for rebalance: 6 “y” has fewest samples: 0 Total unlabelled words: 5 Final labelled words: 235
Generating training data
TRAINING … Font name = arialwnFont
Generated training data for 235 blobs

As you can see on message TRAINING … Font name = arialwnFont, tesseract try to identify font name based on input filename. Unfortunately there is a bug in blobclass.cpp, so name is not used correctly. I created patch to solve this issue + tessesact should not crash if you will use other filename than [lang].[font].[exp].

back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]