tesseract-ocr is an OCR engine originally developed by Hewlett Packard and now sponsored by Google. It is highly accurate and will read a binary, gray, or color image and output text.
| Tags | OCR Library cli |
|---|---|
| Licenses | Apache 2.0 |
| Operating Systems | Windows Mac OS X Linux |
| Implementation | C++ |
Recent releases


Release Notes: This release adds a C API, a new solution for VS (2008), right-to-left/Bidi capability in the output iterators for Hebrew/Arabic, paragraph detection in layout analysis/post OCR, fixes for inconsistent xheight during training and over-chopping, simultaneous multi-language capability, a refactored top-level word recognition module, an experimental equation detector, improved handling of resolution from input images, and a blamer module for error analysis. It cleans an externally-used namespace by removing includes from baseapi.h.


Release Notes: This release adds thread safety, a recognizer for Arabic, PageIterator and ResultIterator, and more.


Release Notes: Preparations were made for thread safety. A major new page layout analysis module was added. HOCR output was added. Many more languages were added. Most of the function header comments were documented with doxygen. Leptonica was added for main image I/O and handling.