podpora slovenčiny v Open Source programoch

first notes for tesseract-ocr 3.02 traning   

posledná zmena: 4. June 2012

back to tesseract-ocr-en

Tesseract-ocr 3.02 code is here for some time, but there are no informations about changes in its training process. Based on my experience there should be some changes. Here are my notes.

Expected files

Expected filenames/suffixes for creating ‘traineddata’ file are defined in ccutil/tessdatamanager.h ). Short descriptions for these components can be found in manual page of combine_tessdata.

offset filename type of file created by description
0 config text user (Optional) Language-specific overrides to default config variables.
1 unicharset text unicharset_extractor (Required) The list of symbols that Tesseract recognizes, with properties.
2 unicharambigs text user (Optional) This file contains information on pairs of recognized symbols which are often confused.
3 inttemp binary mftraining (Required) Character shape templates for each unichar.
4 pffmtable binary/text mftraining (Required) The number of features expected for each unichar.
5 normproto text cntraining (Required) Character normalization prototypes
6 punc-dawg dawg wordlist2dawg (Optional) A dawg made from punctuation patterns found around words. The “word” part is replaced by a single space.
7 word-dawg dawg wordlist2dawg (Optional) A dawg made from dictionary words from the language.
8 number-dawg dawg wordlist2dawg (Optional) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
9 freq-dawg dawg wordlist2dawg (Optional) A dawg made from the most frequent words which would have gone into word-dawg.
10 fixed-length-dawgs dawg wordlist2dawg (Optional) Several dawgs of different fixed lengths — useful for languages like Chinese.
11 cube-unicharset text unknown (Optional) A unicharset for cube, if cube was trained on a different set of symbols.
12 cube-word-dawg dawg wordlist2dawg (Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube was trained with Tesseract’s unicharset.
13 shapetable binary shapeclustering (Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar-id and font.
14 bigram-dawg dawg wordlist2dawg (Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?.
15 unambig-dawg dawg wordlist2dawg (Optional)
16 params-training-model unknown unknown (Optional)

I was able do create wordlist from dawg files with tool dawg2wordlist except fixed-length-dawgs (present in present in chi_sim, chi_tra, jpn).

For cube-word-dawg (present in eng, fra) I needed to use cube-unicharset.

cube-unicharset looks like unicharset_extractor v3.00 output.

I did not find unambig-dawg and params-training-mode in any language data file and there is no description for it.


Training process

Lets assume training for language “mic” and only one font “nice” with input image “mic.nice.exp1.tif”. Here are the steps we need to take:

  1. echo nice 0 0 0 0 0 >> font_properties – this will add information about font to file font_properties
  2. tesseract mic.nice.exp1.tif mic.nice.exp1 batch.nochop makebox – this creates file that need to be check/edited (e.g. in QT Box Editor)
  3. tesseract mic.nice.exp1.tif mic.nice.exp1 nobatch box.train – this will create files ‘’ and ‘mic.nice.exp1.txt’
  4. unicharset_extractor – this will create file unicharset
  5. shapeclustering -F font_properties -U unicharset – this will create file shapetable
  6. mftraining -F font_properties -U unicharset – this will create files ‘pffmtable’ and ‘inttemp’
  7. cntraining – this will create file normproto
  8. rename filenames:
    • mv unicharset mic.unicharset
    • mv shapetable mic.shapetable
    • mv normproto mic.normproto
    • mv pffmtable mic.pffmtable
    • mv inttemp mic.inttemp
  9. create dictionaries (optional):
    • wordlist2dawg punc_wordlist mic.punc-dawg mic.unicharset
    • wordlist2dawg words_wordlist mic.word-dawg mic.unicharset
    • wordlist2dawg number_wordlist mic.number-dawg mic.unicharset
    • wordlist2dawg frequent_wordlist mic.freq-dawg mic.unicharset
    • wordlist2dawg bigram_wordlist mic.bigram-dawg mic.unicharset
  10. combine_tessdata mic. – creates language data file mic.traineddata that can by used by tesseract for OCR.


I just did test with Latin script so e.g. for Cyrillic or Asian writing system there could be other findings…

Several language files in 3.02 has included (optional) config files. It looks like there could be few suggestions to improve OCR (in case of custom training):

Ara, hin, kor, chi_tra, chi_sim and jpn have more complex configs. There can be found groups of parameters regarding new segmentation search parameters, turning off dictionary based penalties, blob filtering thresholds and forcing word segmentation to reduce the length of blob sequences that IMO can be useful also for non-Asian languages tuning.

unicharset_extractor does not fill several information in unicharset:

After investigating of available traineddata I found out that ‘glyph_metrics’, ‘script’ and ‘direction’ is the same per unichar regardless language, so it is possible to correct this information with script. ‘direction’ could by analyzed also according ICU’s enum UCharDirection.

‘mirror’ seem to be related to ‘other_case’: e.g. if “i” has ‘other_case’ = 26 and ‘mirror’ = 15 than “I” has ‘other_case’ = 15 and ‘mirror’ = 26. This should be possible to fix.

For shapeclustering and mftraining you can add option -X xheights. I tried to use xheights file but I did not find difference in shapeclustering and mftraining outputs…

Strace shows that these tools lookin also for file mic.nice.exp1.fontinfo that structure is not documented at the moment.

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]