pkgsrc/graphics/tesseract/PLIST

240 lines
7 KiB
Text
Raw Normal View History

@comment $NetBSD: PLIST,v 1.9 2017/02/21 17:51:18 fhajny Exp $
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/ambiguous_words
bin/classifier_tester
bin/cntraining
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/combine_tessdata
bin/dawg2wordlist
bin/mftraining
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
bin/set_unicharset_properties
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
bin/shapeclustering
bin/tesseract
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
bin/text2image
bin/unicharset_extractor
bin/wordlist2dawg
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/apitypes.h
include/tesseract/baseapi.h
include/tesseract/basedir.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/capi.h
include/tesseract/errcode.h
include/tesseract/fileerr.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/genericvector.h
include/tesseract/helpers.h
include/tesseract/host.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/ltrresultiterator.h
include/tesseract/memry.h
include/tesseract/ndminx.h
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
include/tesseract/ocrclass.h
include/tesseract/osdetect.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/pageiterator.h
include/tesseract/params.h
include/tesseract/platform.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/publictypes.h
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
include/tesseract/renderer.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/resultiterator.h
include/tesseract/serialis.h
include/tesseract/strngs.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
include/tesseract/tesscallback.h
include/tesseract/thresholder.h
include/tesseract/unichar.h
include/tesseract/unicharmap.h
include/tesseract/unicharset.h
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
lib/libtesseract.la
lib/pkgconfig/tesseract.pc
man/man1/ambiguous_words.1
man/man1/cntraining.1
man/man1/combine_tessdata.1
man/man1/dawg2wordlist.1
man/man1/mftraining.1
man/man1/shapeclustering.1
man/man1/tesseract.1
man/man1/unicharset_extractor.1
man/man1/wordlist2dawg.1
man/man5/unicharambigs.5
man/man5/unicharset.5
share/tessdata/afr.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/amh.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ara.cube.bigrams
share/tessdata/ara.cube.fold
share/tessdata/ara.cube.lm
share/tessdata/ara.cube.nn
share/tessdata/ara.cube.params
share/tessdata/ara.cube.size
share/tessdata/ara.cube.word-freq
share/tessdata/ara.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/asm.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/aze.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/aze_cyrl.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/bel.traineddata
share/tessdata/ben.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/bod.traineddata
share/tessdata/bos.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/bul.traineddata
share/tessdata/cat.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ceb.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ces.traineddata
share/tessdata/chi_sim.traineddata
share/tessdata/chi_tra.traineddata
share/tessdata/chr.traineddata
share/tessdata/configs/ambigs.train
share/tessdata/configs/api_config
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/bigram
share/tessdata/configs/box.train
share/tessdata/configs/box.train.stderr
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/digits
share/tessdata/configs/hocr
share/tessdata/configs/inter
share/tessdata/configs/kannada
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/linebox
share/tessdata/configs/logfile
share/tessdata/configs/makebox
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/pdf
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/configs/quiet
share/tessdata/configs/rebox
share/tessdata/configs/strokewidth
share/tessdata/configs/tsv
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/configs/txt
share/tessdata/configs/unlv
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/cym.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/dan.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/dan_frak.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/deu.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/deu_frak.traineddata
share/tessdata/dzo.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ell.traineddata
share/tessdata/eng.cube.bigrams
share/tessdata/eng.cube.fold
share/tessdata/eng.cube.lm
share/tessdata/eng.cube.nn
share/tessdata/eng.cube.params
share/tessdata/eng.cube.size
share/tessdata/eng.cube.word-freq
share/tessdata/eng.tesseract_cube.nn
share/tessdata/eng.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/eng.user-patterns
share/tessdata/eng.user-words
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/enm.traineddata
share/tessdata/epo.traineddata
share/tessdata/equ.traineddata
share/tessdata/est.traineddata
share/tessdata/eus.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/fas.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/fin.traineddata
share/tessdata/fra.cube.bigrams
share/tessdata/fra.cube.fold
share/tessdata/fra.cube.lm
share/tessdata/fra.cube.nn
share/tessdata/fra.cube.params
share/tessdata/fra.cube.size
share/tessdata/fra.cube.word-freq
share/tessdata/fra.tesseract_cube.nn
share/tessdata/fra.traineddata
share/tessdata/frk.traineddata
share/tessdata/frm.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/gle.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/glg.traineddata
share/tessdata/grc.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/guj.traineddata
share/tessdata/hat.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/heb.traineddata
share/tessdata/hin.cube.bigrams
share/tessdata/hin.cube.fold
share/tessdata/hin.cube.lm
share/tessdata/hin.cube.nn
share/tessdata/hin.cube.params
share/tessdata/hin.cube.word-freq
share/tessdata/hin.tesseract_cube.nn
share/tessdata/hin.traineddata
share/tessdata/hrv.traineddata
share/tessdata/hun.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/iku.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ind.traineddata
share/tessdata/isl.traineddata
share/tessdata/ita.cube.bigrams
share/tessdata/ita.cube.fold
share/tessdata/ita.cube.lm
share/tessdata/ita.cube.nn
share/tessdata/ita.cube.params
share/tessdata/ita.cube.size
share/tessdata/ita.cube.word-freq
share/tessdata/ita.tesseract_cube.nn
share/tessdata/ita.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ita_old.traineddata
share/tessdata/jav.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/jpn.traineddata
share/tessdata/kan.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/kat.traineddata
share/tessdata/kat_old.traineddata
share/tessdata/kaz.traineddata
share/tessdata/khm.traineddata
share/tessdata/kir.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/kor.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/kur.traineddata
share/tessdata/lao.traineddata
share/tessdata/lat.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/lav.traineddata
share/tessdata/lit.traineddata
share/tessdata/mal.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/mar.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/mkd.traineddata
share/tessdata/mlt.traineddata
share/tessdata/msa.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/mya.traineddata
share/tessdata/nep.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/nld.traineddata
share/tessdata/nor.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/ori.traineddata
share/tessdata/osd.traineddata
share/tessdata/pan.traineddata
share/tessdata/pdf.ttf
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/pol.traineddata
share/tessdata/por.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/pus.traineddata
share/tessdata/ron.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/rus.cube.fold
share/tessdata/rus.cube.lm
share/tessdata/rus.cube.nn
share/tessdata/rus.cube.params
share/tessdata/rus.cube.size
share/tessdata/rus.cube.word-freq
share/tessdata/rus.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/san.traineddata
share/tessdata/sin.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/slk.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/slk_frak.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/slv.traineddata
share/tessdata/spa.cube.bigrams
share/tessdata/spa.cube.fold
share/tessdata/spa.cube.lm
share/tessdata/spa.cube.nn
share/tessdata/spa.cube.params
share/tessdata/spa.cube.size
share/tessdata/spa.cube.word-freq
share/tessdata/spa.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/spa_old.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/sqi.traineddata
share/tessdata/srp.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/srp_latn.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/swa.traineddata
share/tessdata/swe.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/syr.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tam.traineddata
share/tessdata/tel.traineddata
share/tessdata/tessconfigs/batch
share/tessdata/tessconfigs/batch.nochop
share/tessdata/tessconfigs/matdemo
share/tessdata/tessconfigs/msdemo
share/tessdata/tessconfigs/nobatch
share/tessdata/tessconfigs/segdemo
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/tgk.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tgl.traineddata
share/tessdata/tha.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/tir.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/tur.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/uig.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/ukr.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/urd.traineddata
share/tessdata/uzb.traineddata
share/tessdata/uzb_cyrl.traineddata
Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 18:06:02 +02:00
share/tessdata/vie.traineddata
Update graphics/tesseract to 3.04.01. Move to new home at Github. Clean up. 2015-02-17 - V3.04.01 - Added OSD renderer for psm 0. Works for single page and multi-page images. - Improve tesstrain.sh script. - Simplify build and run of ScrollView. - Improved PDF output for OS X Preview utility. - INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - Enable OpenMP support. - Many bug fixes. 2015-07-11 - V3.04.00 - Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - Tesseract now requires leptonica 1.71 or a higher version. - Removed official support for VS 2008. - Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - Major updates to training system as a result of extensive testing on 100 languages. - New training data for over 100 languages - Improved performance with PIC compilation option. - Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - Improved font identification. - Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - Major refactor to improve speed on difficult images, especially when running a heap checker. - Moved params from global in page layout to tesseractclass. - Improved single column layout analysis. - Allow ocr output to multiple formats using tesseract command line executable. - Fixed issues with mixed eng+ara scripts. - Improved script consistency in numbers. - Major refactor of control.cpp to enable line recognition. - Added tesstrain.sh - a master training script. - Added ability to text2image training tool to just list available fonts. - Added ability to text2image to underline words. - Improved efficiency of image processing for PDF output. - Added parameter description for each parameter listed with 'print-parameters' command line option. - Added font info to hOCR output. - Enabled streaming input and output of multi-page documents. - Many bug fixes. 2014-02-04 - V3.03(rc1) - Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. - Added support for PDF output with searchable text. - Removed entire IMAGE class and all code in image directory. - Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) - Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. - Major refactor of word-level recognition, beam search, eliminating dead code. - Refactored classifier to make it easier to add new ones. - Generalized feature extractor to allow feature extraction from greyscale. - Improved sub/superscript treatment. - Improved baseline fit. - Added set_unicharset_properties to training tools. - Many bug fixes. - More training source data included.
2016-03-17 13:51:14 +01:00
share/tessdata/yid.traineddata