sk-spell

podpora slovenčiny v Open Source programoch

tesseract-ocr-en: dictionary creating

posledná zmena: 18. May 2010

back to tesseract-ocr-en

dawg files

As explain in my article “test – what is eng.traineddata?“ tesseract 3.00 expects several dawg (Directed Acyclic Word Graph) dictionaries:

lang.punc-dawg – dawg with punctuation patterns
lang.number-dawg – dawg with number pattern
lang.freq-dawg – frequent word dawg
lang.word-dawg – system word dawg

These files are created from simple UTF-8 text files (one word per line) by program wordlist2dawg. As a second parameter it needs unicharset file. So for Slovak I run:

$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg number \
   slk.number-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg punc \
   slk.punc-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg word_list \
   slk.word-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg frequency_list \
   slk.freq-dawg slk.unicharset

Dictionary helps to improve result of OCR. For example: in some fonts/cases it is difficult to distinguish between “l” and “1” for OCR software. In such cases dictionary could help: OCR result will not be “a11” but “all” (if “all” is in dictionary and “a11” is not in dictionary).

In tesseract 3.00 dawg dictionaries are optional files (in case of version 2.04 you must have dictionary files otherwise tesseract do not work).

If you decide to create dictionary, there must be at least one word in input file. Input file could be created from wikipedia easily. Other good sources could be spellcheckers, translation dictionaries or other linguistics open projects, but pay attention to license condition of data.

If you need to turn off some dawg file or to increase verbosity for lang.traineddata file, you can use following variables:

variable	default setting	comment
global_load_punc_dawg	true	Load dawg with punctuation patterns.
global_load_number_dawg	true	Load dawg with number patterns.
global_load_freq_dawg	true	Load frequent word dawg.
global_load_system_dawg	true	Load system word dawg.
global_tessdata_manager_debug_level	0	Debug level for TessdataManager functions.

ambiguity file – lang.unicharambigs

According Training Tesseract 2.04 this file is created manually. It represents the intrinsic ambiguity between characters or sets of characters. It is optional file (e.g. you can skipped it for creating lang.traineddata)

Here is example of few lines from eng.unicharambigs:

v1
2	' '	1	"	1
2	` ’	1	"	1
2	’ `	1	"	1
2	‘ ‘	1	“	1
2	‘ ’	1	"	1
2	’ ‘	1	"	1
2	’ ’	1	”	1
2	, ,	1	„	1
1	m	2	r n	0
2	r n	1	m	0
1	m	2	i n	0

For tesseract 3.00 there are some changes:

first line determine the version of the ambigs file.
there are 5 columns of information (instead of 4 as in tesseract 2.04):

the number of character(s) in the 2nd field
shows for how character(s) was recognized (i.e. incorrect from)
the number of character(s) in the 4th field
shows for how character(s) should be recognized (i.e. correct form)
If I understood comment in source file (ccutil/ambigs.cpp) correctly, this new field indicates whether the ambiguity should always be substituted (e.g. '' should always be changed to ").

There are several rules for this files:

all characters used in second and fourth column must be present in lang.unicharset file
tab(ulator) or \t is separator between columns
space is separator between characters in second and fourth column
each line (including last line!) must end with (unix?) end-of-line (you must press “ENTER”) otherwise combine_tessdata will produce error (last_char == ‘\n’:Error:Assert failed:in file tessdatamanager.cpp, line 92) - updated on 18.05.2010

If you are interested in the development of lang.unicharambigs please have a look to extracted unicharambigs files from tesseract 3.00 lang.traineddata. Files for following languages are present in this package:

deu – German
ell – Modern Greek
eng – English
fra – French
ita – Italian
nld – Dutch, Flemish
rus – Russian
spa – Spanish

sk-spell

tesseract-ocr-en: dictionary creating

back to tesseract-ocr-en

dawg files

ambiguity file – lang.unicharambigs

back to tesseract-ocr-en

© projekt sk-spell