sk-spell

podpora slovenčiny v Open Source programoch

Building minimalistic tesseract   

posledná zmena: 7. January 2021

back to tesseract-ocr-en

If you plan to use OCR (no training or OCR debugging) in your project you need only minimal tesseract build (and leptonica): for opening files you can use other available libraries e.g. opencv, python PIL, QT (or other framework) native functions.

With minimal tesseract distribution you can avoid redundant libraries distributions, easier upgrade and distribution.

Here is example for windows build.

Requirements

Setup

NOTE: mkdir is tool from git ( located e.g. c:\Program Files\Git\usr\bin\mkdir.exe)

You can adjust installation path to your needs.
mkdir -p F:\win64_msvc_min\share\tessdata_best\tessdata
set INSTALL_DIR=F:\win64_msvc_min
set PATH=%PATH%;%INSTALL_DIR%\bin;
set TESSDATA_PREFIX=F:\win64_msvc_min\share\tessdata_best\tessdata
"c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" x64

Build

 mkdir mininalistic && cd mininalistic

Leptonica

git clone --depth 1 https://github.com/DanBloomberg/leptonica.git
cd leptonica
mkdir build64.msvc && cd build64.msvc
cmake .. -DCMAKE_INSTALL_PREFIX=%INSTALL_DIR% -DCMAKE_PREFIX_PATH=%INSTALL_DIR% -DCMAKE_BUILD_TYPE=Release -DBUILD_PROG=OFF -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON
Successful configuration will looks like this:
...
-- General configuration for Leptonica 1.81.0
-- --------------------------------------------------------
-- Build type: Release
-- Compiler: MSVC
-- C compiler options:  /DWIN32 /D_WINDOWS /W3
-- Linker options: /machine:x64
-- Install directory: F:/win64_msvc_min
--
-- Build with sw [SW_BUILD]: OFF
-- Build utility programs [BUILD_PROG]: OFF
-- Used ZLIB library: ZLIB_LIBRARY-NOTFOUND
-- Used PNG library:
-- Used JPEG library: JPEG_LIBRARY-NOTFOUND
-- Used JP2K library:
-- Used TIFF library: TIFF_LIBRARY-NOTFOUND
-- Used GIF library:
-- Used WEBP library:
-- --------------------------------------------------------
As you can see no external dependency was find (which is desired) Now we can build and install leptonica with command:
cmake --build . --config Release --target install

Tesseract

cd ..\..\ 
git clone -b 5.0.0-alpha-20201224 --depth 1 https://github.com/tesseract-ocr/tesseract.git
cd tesseract
mkdir build64.msvc && cd build64.msvc
cmake .. -DCMAKE_INSTALL_PREFIX=%INSTALL_DIR% -DCMAKE_PREFIX_PATH=%INSTALL_DIR% -DCMAKE_BUILD_TYPE=Release  -DSW_BUILD=OFF -DBUILD_TRAINING_TOOLS=OFF -DGRAPHICS_DISABLED=ON -DENABLE_LTO=ON
Successful configuration will looks like this:
...
-- General configuration for Tesseract 5.0.0-alpha
-- --------------------------------------------------------
-- Build type: Release
-- Compiler: MSVC
-- Used standard: C++17
-- CXX compiler options: /DWIN32 /D_WINDOWS /W3 /GR /EHsc /utf-8 /MP /O2 /Ob2 /DNDEBUG /wd4244 /wd4305 /wd4267
-- Compile definitions = HAVE_AVX;HAVE_AVX2;HAVE_FMA;HAVE_SSE4_1;_CRT_SECURE_NO_WARNINGS;HAVE_CONFIG_H
-- Linker options: /machine:x64
-- Install directory: F:/win64_msvc_min
-- Architecture flags: /arch:AVX2
-- Vector unit list: sse2;sse3;ssse3;sse4.1;sse4.2;avx;fma;bmi2;avx2
-- HAVE_AVX: ON
-- HAVE_AVX2: ON
-- HAVE_FMA: ON
-- HAVE_SSE4_1: ON
-- MARCH_NATIVE_OPT:
-- HAVE_NEON:
-- Link-time optimization: TRUE
-- --------------------------------------------------------
-- Build with sw [SW_BUILD]: OFF
-- Build with openmp support [OPENMP_BUILD]: OFF
-- Disable disable graphics (ScrollView) [GRAPHICS_DISABLED]: ON
-- Disable the legacy OCR engine [DISABLED_LEGACY_ENGINE]: OFF
-- Build training tools [BUILD_TRAINING_TOOLS]: OFF
-- Build tests [BUILD_TESTS]: OFF
-- Use system ICU Library [USE_SYSTEM_ICU]: OFF
-- --------------------------------------------------------
Now we can build and install tesseract with command:
cmake --build . --config Release --target install

For unknown reason msvc build fails (error MSB3073: The command “setlocal”…) for first time build. But if you run above command once again, it will be successful:

 -- Install configuration: "Release"
  -- Installing: F:/win64_msvc_min/lib/pkgconfig/tesseract.pc
  -- Installing: F:/win64_msvc_min/bin/tesseract.exe
  -- Installing: F:/win64_msvc_min/lib/tesseract50.lib
  -- Installing: F:/win64_msvc_min/bin/tesseract50.dll
  -- Installing: F:/win64_msvc_min/lib/cmake/tesseract/TesseractTargets.cmake
  -- Installing: F:/win64_msvc_min/lib/cmake/tesseract/TesseractTargets-release.cmake
  -- Up-to-date: F:/win64_msvc_min/lib/cmake
  -- Installing: F:/win64_msvc_min/lib/cmake/TesseractConfig.cmake
  -- Installing: F:/win64_msvc_min/lib/cmake/TesseractConfigVersion.cmake
  -- Installing: F:/win64_msvc_min/include/tesseract/apitypes.h
  -- Installing: F:/win64_msvc_min/include/tesseract/baseapi.h
  -- Installing: F:/win64_msvc_min/include/tesseract/capi.h
  -- Installing: F:/win64_msvc_min/include/tesseract/renderer.h
  -- Installing: F:/win64_msvc_min/include/tesseract/version.h
  -- Installing: F:/win64_msvc_min/include/tesseract/thresholder.h
  -- Installing: F:/win64_msvc_min/include/tesseract/ltrresultiterator.h
  -- Installing: F:/win64_msvc_min/include/tesseract/pageiterator.h
  -- Installing: F:/win64_msvc_min/include/tesseract/resultiterator.h
  -- Installing: F:/win64_msvc_min/include/tesseract/osdetect.h
  -- Installing: F:/win64_msvc_min/include/tesseract/publictypes.h
  -- Installing: F:/win64_msvc_min/include/tesseract/genericvector.h
  -- Installing: F:/win64_msvc_min/include/tesseract/helpers.h
  -- Installing: F:/win64_msvc_min/include/tesseract/ocrclass.h
  -- Installing: F:/win64_msvc_min/include/tesseract/platform.h
  -- Installing: F:/win64_msvc_min/include/tesseract/serialis.h
  -- Installing: F:/win64_msvc_min/include/tesseract/strngs.h
  -- Installing: F:/win64_msvc_min/include/tesseract/unichar.h

Now you have build minimalistic tesseract build for using in your project like simple tesseract wrapper in Python

Remark

If you plan to use tesseract pdf output, you will need leptonica with libpng, libz, libjpeg, libtiff support.

Tests

Test tesseract verion:
tesseract --version
tesseract 5.0.0-alpha
 leptonica-1.81.0 (Nov  1 2020, 19:13:26) [MSC v.1927 LIB Release x64]
  (null)
 Found AVX2
 Found AVX
 Found FMA
 Found SSE 

As you can see there no support for any (external) image libraries. Anyway tesseract able to use some simple image formats like pnm, ppm, bmp, spix.

So lets try to use some simple image:
tesseract line.ppm - --dpi 300
It will produce:
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
N27 26 10 04 03 01

Error messages are produced by leptonica, because tesseract is trying to use some function that
requires (unavailable external libraries). This messages can be suppress by leptonica function setMsgSeverity OCR result is in last line.

Simple python example usage

Code could be downloaded from PasteBin


#!/usr/bin/env python3

import ctypes
import locale
import os
import platform
from ctypes.util import find_library

import cffi
from PIL import Image, ImageDraw, ImageFont

ffi = cffi.FFI
ffi.cdef( “”“
typedef signed char l_int8;
typedef unsigned char l_uint8;
typedef short l_int16;
typedef unsigned short l_uint16;
typedef int l_int32;
typedef unsigned int l_uint32;
typedef float l_float32;
typedef double l_float64;
typedef long long l_int64;
typedef unsigned long long l_uint64;
typedef int l_ok; /*!< return type 0 if OK, 1 on error */

struct Pix;
typedef struct Pix PIX;
typedef enum lept_img_format { IFF_UNKNOWN = 0, IFF_BMP = 1, IFF_JFIF_JPEG = 2, IFF_PNG = 3, IFF_TIFF = 4, IFF_TIFF_PACKBITS = 5, IFF_TIFF_RLE = 6, IFF_TIFF_G3 = 7, IFF_TIFF_G4 = 8, IFF_TIFF_LZW = 9, IFF_TIFF_ZIP = 10, IFF_PNM = 11, IFF_PS = 12, IFF_GIF = 13, IFF_JP2 = 14, IFF_WEBP = 15, IFF_LPDF = 16, IFF_TIFF_JPEG = 17, IFF_DEFAULT = 18, IFF_SPIX = 19
};

typedef enum newsev { L_SEVERITY_EXTERNAL = 0, /* Get the severity from the environment */ L_SEVERITY_ALL = 1, /* Lowest severity: print all messages */ L_SEVERITY_DEBUG = 2, /* Print debugging and higher messages */ L_SEVERITY_INFO = 3, /* Print informational and higher messages */ L_SEVERITY_WARNING = 4, /* Print warning and higher messages */ L_SEVERITY_ERROR = 5, /* Print error and higher messages */ L_SEVERITY_NONE = 6 /* Highest severity: print no messages */
};

char * getLeptonicaVersion ( );
PIX * pixRead ( const char *filename );
PIX * pixCreate ( int width, int height, int depth );
PIX * pixEndianByteSwapNew(PIX *pixs);
l_int32 pixSetData ( PIX *pix, l_uint32 *data );
l_ok pixSetPixel ( PIX *pix, l_int32 x, l_int32 y, l_uint32 val );
l_ok pixWrite ( const char *fname, PIX *pix, l_int32 format );
l_int32 pixFindSkew ( PIX *pixs, l_float32 *pangle, l_float32 *pconf );
PIX * pixDeskew ( PIX *pixs, l_int32 redsearch );
void pixDestroy ( PIX **ppix );
l_ok pixGetResolution ( const PIX *pix, l_int32 *pxres, l_int32 *pyres );
l_ok pixSetResolution ( PIX *pix, l_int32 xres, l_int32 yres );
l_int32 pixGetWidth ( const PIX *pix );
l_int32 setMsgSeverity ( l_int32 newsev );

typedef struct TessBaseAPI TessBaseAPI;
typedef struct ETEXT_DESC ETEXT_DESC;
typedef struct TessPageIterator TessPageIterator;
typedef struct TessResultIterator TessResultIterator;
typedef int BOOL;

typedef enum TessOcrEngineMode { OEM_TESSERACT_ONLY = 0, OEM_LSTM_ONLY = 1, OEM_TESSERACT_LSTM_COMBINED = 2, OEM_DEFAULT = 3} TessOcrEngineMode;

typedef enum TessPageSegMode { PSM_OSD_ONLY = 0, PSM_AUTO_OSD = 1, PSM_AUTO_ONLY = 2, PSM_AUTO = 3, PSM_SINGLE_COLUMN = 4, PSM_SINGLE_BLOCK_VERT_TEXT = 5, PSM_SINGLE_BLOCK = 6, PSM_SINGLE_LINE = 7, PSM_SINGLE_WORD = 8, PSM_CIRCLE_WORD = 9, PSM_SINGLE_CHAR = 10, PSM_SPARSE_TEXT = 11, PSM_SPARSE_TEXT_OSD = 12, PSM_COUNT = 13} TessPageSegMode;

typedef enum TessPageIteratorLevel { RIL_BLOCK = 0, RIL_PARA = 1, RIL_TEXTLINE = 2, RIL_WORD = 3, RIL_SYMBOL = 4} TessPageIteratorLevel;

TessPageIterator* TessBaseAPIAnalyseLayout(TessBaseAPI* handle);
TessPageIterator* TessResultIteratorGetPageIterator(TessResultIterator* handle);

BOOL TessPageIteratorNext(TessPageIterator* handle, TessPageIteratorLevel level);
BOOL TessPageIteratorBoundingBox(const TessPageIterator* handle, TessPageIteratorLevel level, int* left, int* top, int* right, int* bottom);

const char* TessVersion();

TessBaseAPI* TessBaseAPICreate();
int TessBaseAPIInit3(TessBaseAPI* handle, const char* datapath, const char* language);
int TessBaseAPIInit2(TessBaseAPI* handle, const char* datapath, const char* language, TessOcrEngineMode oem);
void TessBaseAPISetPageSegMode(TessBaseAPI* handle, TessPageSegMode mode);
void TessBaseAPISetImage(TessBaseAPI* handle, const unsigned char* imagedata, int width, int height, int bytes_per_pixel, int bytes_per_line);
void TessBaseAPISetImage2(TessBaseAPI* handle, struct Pix* pix);

BOOL TessBaseAPISetVariable(TessBaseAPI* handle, const char* name, const char* value);
BOOL TessBaseAPIDetectOrientationScript(TessBaseAPI* handle, char** best_script_name, int* best_orientation_deg, float* script_confidence, float* orientation_confidence);
int TessBaseAPIRecognize(TessBaseAPI* handle, ETEXT_DESC* monitor);
TessResultIterator* TessBaseAPIGetIterator(TessBaseAPI* handle);
BOOL TessResultIteratorNext(TessResultIterator* handle, TessPageIteratorLevel level);
char* TessResultIteratorGetUTF8Text(const TessResultIterator* handle, TessPageIteratorLevel level);
float TessResultIteratorConfidence(const TessResultIterator* handle, TessPageIteratorLevel level);
char* TessBaseAPIGetUTF8Text(TessBaseAPI* handle);
const char* TessResultIteratorWordFontAttributes(const TessResultIterator* handle, BOOL* is_bold, BOOL* is_italic, BOOL* is_underlined, BOOL* is_monospace, BOOL* is_serif, BOOL* is_smallcaps, int* pointsize, int* font_id);
void TessBaseAPIEnd(TessBaseAPI* handle);
void TessBaseAPIDelete(TessBaseAPI* handle);
“”“
)

def get_abs_path_of_library(library): “”“Get absolute path of library.“
abs_path = None
lib_name = find_library(library)
if os.path.exists(lib_name):
abs_path = os.path.abspath(lib_name)
return abs_path
libdl = ctypes.CDLL
if not libdl:
return abs_path # None
try:
dlinfo = libdl.dlinfos
except AttributeError as err:
# Workaroung for linux
abs_path = str(err).split(
”)0 return abs_path

def pil2PIX32(im, leptonica): “”“Convert PIL to leptonica PIX.”“” # At the moment we handle everything as RGBA image if im.mode != “RGBA”: im = im.convert(“RGBA”) depth = 32 width, height = im.size data = im.tobytes(“raw”, “RGBA”) pixs = leptonica.pixCreate(width, height, depth) leptonica.pixSetData(pixs, ffi.from_buffer(“l_uint32[]”, data))

try: resolutionX = im.info[“resolution”]0 resolutionY = im.info[“resolution”]1 leptonica.pixSetResolution(pixs, resolutionX, resolutionY) except KeyError: pass try: resolutionX = im.info[“dpi”]0 resolutionY = im.info[“dpi”]1 leptonica.pixSetResolution(pixs, resolutionX, resolutionY) except KeyError: pass return leptonica.pixEndianByteSwapNew(pixs)

def img_lepto_to_pil(pix): “”“Convert leptonica pix to PIL Source: https://stackoverflow.com/questions/55195932/typeerror-initializer-for-ctype-unsigned-int-must-be-a-cdata-pointer-not-b/57776268#57776268 “”“ cdata_ptr = ffi.new(“l_uint8**”) size_ptr = ffi.new(“size_t*”) leptonica.pixWriteMem(cdata_ptr, size_ptr, pix, IFF_TIFF) cdata = cdata_ptr0 size = size_ptr0

tiff_bytes = bytes(ffi.buffer(cdata, size)) with BytesIO(tiff_bytes) as bytesio: pilimage = PIL.Image.open(bytesio).copy() return pilimag

def main(): ‘’‘Main loop.’‘’ # Settings tess_libname = r“F:/win64_msvc_min/bin/tesseract50.dll” lept_libname = r“F:/win64_msvc_min/bin/leptonica-1.81.0.dll” filename = “line.ppm” lang = “eng”

tessdata = os.environ.get(“TESSDATA_PREFIX”) if not tessdata: # Use project tessdata tessdata = os.path.join(os.getcwd(), “tessdata”) os.environ[“TESSDATA_PREFIX”] = tessdata # Load libraries in ABI mode if os.path.exists(tess_libname): tesseract = ffi.dlopen(tess_libname) else: print(f”’{tess_libname}’ does not exists!”) tesseract_version = ffi.string(tesseract.TessVersion()) print(“Tesseract-ocr version”, tesseract_version.decode(“utf-8”)) if os.path.exists(lept_libname): leptonica = ffi.dlopen(lept_libname) else: print(f”’{lept_libname}’ does not exists!”) leptonica_version = ffi.string(leptonica.getLeptonicaVersion()) print(leptonica_version.decode(“utf-8”)) api = None # Read image to pix im = Image.open(filename) pix = pil2PIX32(im, leptonica) # Turn off leptonica warnings leptonica.setMsgSeverity(leptonica.L_SEVERITY_EXTERNAL) # Create tesseract API if api: tesseract.TessBaseAPIEnd(api) tesseract.TessBaseAPIDelete(api) api = tesseract.TessBaseAPICreate() oem = tesseract.OEM_DEFAULT tesseract.TessBaseAPIInit2(api, tessdata.encode(), lang.encode(), oem) tesseract.TessBaseAPISetPageSegMode(api, tesseract.PSM_AUTO) tesseract.TessBaseAPISetImage2(api, pix) # recognize is needed to get result iterator tesseract.TessBaseAPIRecognize(api, ffi.NULL) utf8_text = ffi.string(tesseract.TessBaseAPIGetUTF8Text(api)).decode(“utf-8”) print(utf8_text) # Delete api and pix if api: tesseract.TessBaseAPIEnd(api) tesseract.TessBaseAPIDelete(api) result = ffi.new(“PIX**”) result0 = pix leptonica.pixDestroy(result) del pix del result api = None

if name == “main”: main()

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]