There was huge update of tesseract-ocr language files on 24.06.2015 – 98 traineddata were updated or first uploaded. At the moment 105 of language or language version are supported (+2 special modules osd and equ). The corresponding source training data where commited into langdata repository.
ara, eng, hin, kor, osd, equ traineddata are NOT updated due to regression. The other regressions are mostly fixed, with some dramatic improvements particularly for Indic (like 20% for kan for example).
There was no update for cube files, because cube is dead end and will be removed after new classifier implementation.
Language files are located in separate repository tessdata on github.com. Total file size is 1.2 GiB at the moment. If you plan to clone whole repository you need to calculate much more space because git keep all history (e.g. previous version of files) in local copy…
Therefore it would be more efficient to download only needed language files. I created small program get_tessdata.cpp that download and install desired file for you. It is not smart (e.g. there could be problem with proxy, or you can not choose file version), but I hope it can help. Usage is simple (after compilation ;-) ) – e.g.:
sudo ./get_tessdata -f fra.traineddata
Here is the list of available files at tessdata repository as of 29.06.2015:
lang code | lang name | file size | link |
afr | Afrikaans | 5.0 MiB | afr.traineddata |
amh | Amharic | 2.8 MiB | amh.traineddata |
ara | Arabic | 99.5 KiB | ara.cube.bigrams |
ara | Arabic | 4.0 B | ara.cube.fold |
ara | Arabic | 241.0 B | ara.cube.lm |
ara | Arabic | 820.7 KiB | ara.cube.nn |
ara | Arabic | 251.0 B | ara.cube.params |
ara | Arabic | 19.1 MiB | ara.cube.size |
ara | Arabic | 1.2 MiB | ara.cube.word-freq |
ara | Arabic | 6.0 MiB | ara.traineddata |
asm | Assamese | 15.1 MiB | asm.traineddata |
aze | Azerbaijani | 6.3 MiB | aze.traineddata |
aze_cyrl | Azerbaijani – Cyrilic | 2.7 MiB | aze_cyrl.traineddata |
bel | Belarusian | 6.5 MiB | bel.traineddata |
ben | Bengali | 14.8 MiB | ben.traineddata |
bod | Tibetan | 24.1 MiB | bod.traineddata |
bos | Bosnian | 5.2 MiB | bos.traineddata |
bul | Bulgarian | 5.7 MiB | bul.traineddata |
cat | Catalan; Valencian | 5.1 MiB | cat.traineddata |
ceb | Cebuano | 1.6 MiB | ceb.traineddata |
ces | Czech | 11.3 MiB | ces.traineddata |
chi_sim | Chinese – Simplified | 40.1 MiB | chi_sim.traineddata |
chi_tra | Chinese – Traditional | 54.1 MiB | chi_tra.traineddata |
chr | Cherokee | 1.0 MiB | chr.traineddata |
cym | Welsh | 3.6 MiB | cym.traineddata |
dan | Danish | 7.0 MiB | dan.traineddata |
dan_frak | Danish – Fraktur | 1.5 MiB | dan_frak.traineddata |
deu | German | 12.7 MiB | deu.traineddata |
deu_frak | German – Fraktur | 1.9 MiB | deu_frak.traineddata |
dzo | Dzongkha | 3.2 MiB | dzo.traineddata |
ell | Greek, Modern (1453-) | 5.2 MiB | ell.traineddata |
eng | English | 167.9 KiB | eng.cube.bigrams |
eng | English | 38.0 B | eng.cube.fold |
eng | English | 181.0 B | eng.cube.lm |
eng | English | 837.2 KiB | eng.cube.nn |
eng | English | 254.0 B | eng.cube.params |
eng | English | 12.4 MiB | eng.cube.size |
eng | English | 2.3 MiB | eng.cube.word-freq |
eng | English | 996.0 B | eng.tesseract_cube.nn |
eng | English | 20.9 MiB | eng.traineddata |
enm | English, Middle (1100-1500) | 2.0 MiB | enm.traineddata |
epo | Esperanto | 6.3 MiB | epo.traineddata |
equ | Math / equation detection module | 2.1 MiB | equ.traineddata |
est | Estonian | 9.2 MiB | est.traineddata |
eus | Basque | 4.7 MiB | eus.traineddata |
fas | Persian | 4.6 MiB | fas.traineddata |
fin | Finnish | 12.7 MiB | fin.traineddata |
fra | French | 127.0 KiB | fra.cube.bigrams |
fra | French | 59.0 B | fra.cube.fold |
fra | French | 301.0 B | fra.cube.lm |
fra | French | 949.5 KiB | fra.cube.nn |
fra | French | 242.0 B | fra.cube.params |
fra | French | 18.4 MiB | fra.cube.size |
fra | French | 2.8 MiB | fra.cube.word-freq |
fra | French | 660.0 B | fra.tesseract_cube.nn |
fra | French | 13.4 MiB | fra.traineddata |
frk | Frankish | 15.7 MiB | frk.traineddata |
frm | French, Middle (ca.1400-1600) | 15.1 MiB | frm.traineddata |
gle | Irish | 3.3 MiB | gle.traineddata |
glg | Galician | 5.3 MiB | glg.traineddata |
grc | Greek, Ancient (to 1453) | 4.9 MiB | grc.traineddata |
guj | Gujarati | 10.1 MiB | guj.traineddata |
hat | Haitian; Haitian Creole | 1.3 MiB | hat.traineddata |
heb | Hebrew | 4.1 MiB | heb.traineddata |
hin | Hindi | 67.4 KiB | hin.cube.bigrams |
hin | Hindi | 1.0 B | hin.cube.fold |
hin | Hindi | 211.0 B | hin.cube.lm |
hin | Hindi | 6.9 MiB | hin.cube.nn |
hin | Hindi | 262.0 B | hin.cube.params |
hin | Hindi | 1.2 MiB | hin.cube.word-freq |
hin | Hindi | 660.0 B | hin.tesseract_cube.nn |
hin | Hindi | 13.5 MiB | hin.traineddata |
hrv | Croatian | 8.7 MiB | hrv.traineddata |
hun | Hungarian | 11.6 MiB | hun.traineddata |
iku | Inuktitut | 971.9 KiB | iku.traineddata |
ind | Indonesian | 6.2 MiB | ind.traineddata |
isl | Icelandic | 5.8 MiB | isl.traineddata |
ita | Italian | 119.8 KiB | ita.cube.bigrams |
ita | Italian | 51.0 B | ita.cube.fold |
ita | Italian | 257.0 B | ita.cube.lm |
ita | Italian | 872.1 KiB | ita.cube.nn |
ita | Italian | 314.0 B | ita.cube.params |
ita | Italian | 13.3 MiB | ita.cube.size |
ita | Italian | 3.4 MiB | ita.cube.word-freq |
ita | Italian | 660.0 B | ita.tesseract_cube.nn |
ita | Italian | 13.6 MiB | ita.traineddata |
ita_old | Italian – Old | 13.4 MiB | ita_old.traineddata |
jav | Javanese | 4.2 MiB | jav.traineddata |
jpn | Japanese | 31.5 MiB | jpn.traineddata |
kan | Kannada | 34.0 MiB | kan.traineddata |
kat | Georgian | 5.9 MiB | kat.traineddata |
kat_old | Georgian – Old | 643.9 KiB | kat_old.traineddata |
kaz | Kazakh | 4.3 MiB | kaz.traineddata |
khm | Central Khmer | 46.6 MiB | khm.traineddata |
kir | Kirghiz; Kyrgyz | 5.2 MiB | kir.traineddata |
kor | Korean | 12.7 MiB | kor.traineddata |
kur | Kurdish | 1.9 MiB | kur.traineddata |
lao | Lao | 20.1 MiB | lao.traineddata |
lat | Latin | 5.7 MiB | lat.traineddata |
lav | Latvian | 7.4 MiB | lav.traineddata |
lit | Lithuanian | 8.5 MiB | lit.traineddata |
mal | Malayalam | 8.4 MiB | mal.traineddata |
mar | Marathi | 13.6 MiB | mar.traineddata |
mkd | Macedonian | 3.7 MiB | mkd.traineddata |
mlt | Maltese | 4.9 MiB | mlt.traineddata |
msa | Malay | 6.2 MiB | msa.traineddata |
mya | Burmese | 66.5 MiB | mya.traineddata |
nep | Nepali | 15.1 MiB | nep.traineddata |
nld | Dutch; Flemish | 16.3 MiB | nld.traineddata |
nor | Norwegian | 7.9 MiB | nor.traineddata |
ori | Oriya | 7.5 MiB | ori.traineddata |
osd | Orientation and script detection module | 10.1 MiB | osd.traineddata |
pan | Panjabi; Punjabi | 9.7 MiB | pan.traineddata |
pol | Polish | 13.3 MiB | pol.traineddata |
por | Portuguese | 12.3 MiB | por.traineddata |
pus | Pushto; Pashto | 2.4 MiB | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | 7.6 MiB | ron.traineddata |
rus | Russian | 139.0 B | rus.cube.fold |
rus | Russian | 278.0 B | rus.cube.lm |
rus | Russian | 891.4 KiB | rus.cube.nn |
rus | Russian | 317.0 B | rus.cube.params |
rus | Russian | 14.5 MiB | rus.cube.size |
rus | Russian | 6.7 MiB | rus.cube.word-freq |
rus | Russian | 15.4 MiB | rus.traineddata |
san | Sanskrit | 21.7 MiB | san.traineddata |
sin | Sinhala; Sinhalese | 6.5 MiB | sin.traineddata |
slk | Slovak | 8.7 MiB | slk.traineddata |
slk_frak | Slovak – Fraktur | 825.4 KiB | slk_frak.traineddata |
slv | Slovenian | 6.5 MiB | slv.traineddata |
spa | Spanish; Castilian | 128.9 KiB | spa.cube.bigrams |
spa | Spanish; Castilian | 76.0 B | spa.cube.fold |
spa | Spanish; Castilian | 248.0 B | spa.cube.lm |
spa | Spanish; Castilian | 887.5 KiB | spa.cube.nn |
spa | Spanish; Castilian | 243.0 B | spa.cube.params |
spa | Spanish; Castilian | 18.1 MiB | spa.cube.size |
spa | Spanish; Castilian | 3.1 MiB | spa.cube.word-freq |
spa | Spanish; Castilian | 15.2 MiB | spa.traineddata |
spa_old | Spanish; Castilian – Old | 16.0 MiB | spa_old.traineddata |
sqi | Albanian | 6.3 MiB | sqi.traineddata |
srp | Serbian | 4.4 MiB | srp.traineddata |
srp_latn | Serbian – Latin | 5.8 MiB | srp_latn.traineddata |
swa | Swahili | 3.7 MiB | swa.traineddata |
swe | Swedish | 9.0 MiB | swe.traineddata |
syr | Syriac | 2.6 MiB | syr.traineddata |
tam | Tamil | 4.9 MiB | tam.traineddata |
tel | Telugu | 37.5 MiB | tel.traineddata |
tgk | Tajik | 1.1 MiB | tgk.traineddata |
tgl | Tagalog | 3.9 MiB | tgl.traineddata |
tha | Thai | 12.9 MiB | tha.traineddata |
tir | Tigrinya | 1.7 MiB | tir.traineddata |
tur | Turkish | 13.4 MiB | tur.traineddata |
uig | Uighur; Uyghur | 1.9 MiB | uig.traineddata |
ukr | Ukrainian | 7.7 MiB | ukr.traineddata |
urd | Urdu | 4.6 MiB | urd.traineddata |
uzb | Uzbek | 4.1 MiB | uzb.traineddata |
uzb_cyrl | Uzbek – Cyrilic | 3.2 MiB | uzb_cyrl.traineddata |
vie | Vietnamese | 5.8 MiB | vie.traineddata |
yid | Yiddish | 4.0 MiB | yid.traineddata |