pkgsrc/graphics/tesseract/PLIST

224 lines
6.6 KiB
Text
Raw Normal View History

@comment $NetBSD: PLIST,v 1.12 2019/07/08 18:37:03 adam Exp $
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/ambiguous_words
bin/classifier_tester
bin/cntraining
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
bin/combine_lang_model
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/combine_tessdata
bin/dawg2wordlist
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
bin/language-specific.sh
bin/lstmeval
bin/lstmtraining
bin/merge_unicharsets
bin/mftraining
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
bin/set_unicharset_properties
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/shapeclustering
bin/tesseract
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
bin/tesstrain.sh
bin/tesstrain_utils.sh
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
bin/text2image
bin/unicharset_extractor
bin/wordlist2dawg
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/apitypes.h
include/tesseract/baseapi.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/capi.h
include/tesseract/genericvector.h
include/tesseract/helpers.h
include/tesseract/ltrresultiterator.h
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
include/tesseract/ocrclass.h
include/tesseract/osdetect.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/pageiterator.h
include/tesseract/platform.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/publictypes.h
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
include/tesseract/renderer.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/resultiterator.h
include/tesseract/serialis.h
include/tesseract/strngs.h
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
include/tesseract/tess_version.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/tesscallback.h
include/tesseract/thresholder.h
include/tesseract/unichar.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
lib/libtesseract.la
lib/pkgconfig/tesseract.pc
man/man1/ambiguous_words.1
man/man1/classifier_tester.1
man/man1/cntraining.1
man/man1/combine_lang_model.1
man/man1/combine_tessdata.1
man/man1/dawg2wordlist.1
man/man1/lstmeval.1
man/man1/lstmtraining.1
man/man1/merge_unicharsets.1
man/man1/mftraining.1
man/man1/set_unicharset_properties.1
man/man1/shapeclustering.1
man/man1/tesseract.1
man/man1/text2image.1
man/man1/unicharset_extractor.1
man/man1/wordlist2dawg.1
man/man5/unicharambigs.5
man/man5/unicharset.5
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/afr.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/amh.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ara.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/asm.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/aze.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/aze_cyrl.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/bel.traineddata
share/tessdata/ben.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/bod.traineddata
share/tessdata/bos.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/bre.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/bul.traineddata
share/tessdata/cat.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ceb.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ces.traineddata
share/tessdata/chi_sim.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/chi_sim_vert.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/chi_tra.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/chi_tra_vert.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/chr.traineddata
share/tessdata/configs/alto
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/ambigs.train
share/tessdata/configs/api_config
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/bigram
share/tessdata/configs/box.train
share/tessdata/configs/box.train.stderr
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/digits
share/tessdata/configs/get.images
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/hocr
share/tessdata/configs/inter
share/tessdata/configs/kannada
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/linebox
share/tessdata/configs/logfile
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/configs/lstm.train
share/tessdata/configs/lstmbox
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/configs/lstmdebug
share/tessdata/configs/makebox
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/pdf
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/quiet
share/tessdata/configs/rebox
share/tessdata/configs/strokewidth
share/tessdata/configs/tsv
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/txt
share/tessdata/configs/unlv
share/tessdata/configs/wordstrbox
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/cos.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/cym.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/dan.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/dan_frak.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/deu.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/deu_frak.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/div.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/dzo.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ell.traineddata
share/tessdata/eng.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/eng.user-patterns
share/tessdata/eng.user-words
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/enm.traineddata
share/tessdata/epo.traineddata
share/tessdata/equ.traineddata
share/tessdata/est.traineddata
share/tessdata/eus.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/fao.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/fas.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/fil.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/fin.traineddata
share/tessdata/fra.traineddata
share/tessdata/frk.traineddata
share/tessdata/frm.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/fry.traineddata
share/tessdata/gla.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/gle.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/glg.traineddata
share/tessdata/grc.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/guj.traineddata
share/tessdata/hat.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/heb.traineddata
share/tessdata/hin.traineddata
share/tessdata/hrv.traineddata
share/tessdata/hun.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/hye.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/iku.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ind.traineddata
share/tessdata/isl.traineddata
share/tessdata/ita.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ita_old.traineddata
share/tessdata/jav.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/jpn.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/jpn_vert.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/kan.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/kat.traineddata
share/tessdata/kat_old.traineddata
share/tessdata/kaz.traineddata
share/tessdata/khm.traineddata
share/tessdata/kir.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/kor.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/kor_vert.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/kur.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/kur_ara.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/lao.traineddata
share/tessdata/lat.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/lav.traineddata
share/tessdata/lit.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/ltz.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/mal.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/mar.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/mkd.traineddata
share/tessdata/mlt.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/mon.traineddata
share/tessdata/mri.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/msa.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/mya.traineddata
share/tessdata/nep.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/nld.traineddata
share/tessdata/nor.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/oci.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ori.traineddata
share/tessdata/osd.traineddata
share/tessdata/pan.traineddata
share/tessdata/pdf.ttf
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/pol.traineddata
share/tessdata/por.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/pus.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/que.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ron.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/rus.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/san.traineddata
share/tessdata/sin.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/slk.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/slk_frak.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/slv.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/snd.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/spa.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/spa_old.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/sqi.traineddata
share/tessdata/srp.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/srp_latn.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/sun.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/swa.traineddata
share/tessdata/swe.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/syr.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tam.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/tat.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tel.traineddata
share/tessdata/tessconfigs/batch
share/tessdata/tessconfigs/batch.nochop
share/tessdata/tessconfigs/matdemo
share/tessdata/tessconfigs/msdemo
share/tessdata/tessconfigs/nobatch
share/tessdata/tessconfigs/segdemo
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/tgk.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tgl.traineddata
share/tessdata/tha.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/tir.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/ton.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tur.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/uig.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ukr.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/urd.traineddata
share/tessdata/uzb.traineddata
share/tessdata/uzb_cyrl.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/vie.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/yid.traineddata
tesseract: updated to 4.0.0 V4.0.0: New OCR engine - Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. - This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model. - Added trained data that includes LSTM models to 123 languages. - Added optional accelerated code paths for the LSTM recognizer: * Using OpenMP * Using SIMD: AVX2 / AVX / SSE4.1 - Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output. - The new LSTM engine still does not support all features from the old legacy engine (see missing features). Other OCR engines - The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version. - Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed. Updated build system - Tesseract now uses semantic versioning. - Tesseract now requires Leptonica 1.74.0 or a higher version. - For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers. - Added unit tests to the main repo. The unit tests require Git submodules and the code for training. - Added an option to compile Tesseract without the code of the legacy OCR engine. - Update minimum required autoconf version to 2.63. - Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0. - Reorganized Tesseract's source tree. Most sources are now below the src directory. Bug fixes and enhancements - Fixed many issues that triggered compiler warnings. - Fixed many issues reported by Coverity Scan or LGTM. - Fixes to trainingdata rendering. - Fixed damage to binary images when processing PDFs. - Don't trigger a deliberate segmentation fault for fatal errors in release code. - Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine. - Improved multi-page TIFF handling. - Improvements to PDF rendering. - Added version information and improved help texts to the training tools. - Added faster version of log2(). - Documented in tesseract man page the option to use an input text file which contains lists of images. - Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API). - Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired. - The list of available languages and scripts is now sorted alphabetically. - Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4. - Removed obsolete code.
2018-11-03 10:13:07 +01:00
share/tessdata/yor.traineddata