pkgsrc

Author	SHA1	Message	Date
adam	ccf89ed63f	Changes 3.02.02: * Moved ResultIterator/PageIterator to ccmain. * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. * Added paragraph detection in layout analysis/post OCR. * Fixed inconsistent xheight during training and over-chopping. * Added simultaneous multi-language capability. * Refactored top-level word recognition module. * Added experimental equation detector. * Improved handling of resolution from input images. * Blamer module added for error analysis. * Cleaned up externally used namespace by removing includes from baseapi.h. * Removed dead memory mangagement code. * Tidied up constraints on control parameters. * Added support for ShapeTable in classifier and training. * Refactored class pruner. * Fixed training leaks and randomness. * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. * Improved line detection and removal. * Added fixed pitch chopper for CJK. * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. * Fixed problems with internally scaled images. * Added page and bbox to string in tr files to identify source of training data better. * Fixes to Hindi Shiroreka splitter. * Added word bigram correction. * Reduced stack memory consumption and eliminated some ugly typedefs. * Added new uniform classifier API. * Added new training error counter. * Fixed endian bug in dawg reader. * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.	2014-10-02 16:06:02 +00:00
joerg	3769fa0bfc	Add a number of includes hidden by libstdc++'s name space pollution.	2013-04-29 21:31:09 +00:00
marino	f50901c296	graphics/tesseract: #include <unistd.h> Fixes out-of-scope errors seen on gcc 4.7.x	2012-11-23 23:52:33 +00:00
dholland	2313ef244d	Add missing <stdio.h>, should fix or improve linux build	2011-11-14 02:44:40 +00:00
wiz	d417e89789	Update to 2.04. Set LICENSE. June 30 2009 - V2.04 Integrated bug fixes and patches and misc changes for portability. Integrated a patch to remove some of the "access" macros. Removed dependence on lua from the viewer, speeding it up dramatically. Fixed the viewer so it compiles and runs properly! Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111, 112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160, 165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169	2009-07-22 20:57:47 +00:00
brook	716ab7596a	Add language-specific data sets distributed by the project. The tesseract distribution itself just creates dummy, placeholder data sets that cannot be used.	2009-07-21 16:00:19 +00:00
wiz	4358c8cac0	Replace patch-ab with a post-extract rule. No change to the binary package, just one file less in pkgsrc ;)	2008-10-30 22:12:59 +00:00
wiz	b4a554e958	Update to 2.03: January 23 2008 - V2.02 Improvements to clustering, training and classifier. Major internationalization improvements for large-character-set languages, eg Kannada. Removed some compiler warnings. Added multipage tiff support for training and running. Updated graphics output to talk to new java-based viewer. Added ability to save n-best lists. Added leptonica support for more file types. Improved Init/End to make them safe. Reduced memory use of dictionaries. Added some new APIs to TessBaseAPI. April 21 2008 - V2.02 (again) Fixed namespace collisions with jpeg library (INT32). Portability fixes for Windows for new code. Updates to autoconf system for new code. April 22 2008 - V2.03 Fixed crash introduced in 2.02. Fixed lack of tessembedded.cpp in distribution. Added test for leptonica header files and conditional test for lib.	2008-05-30 13:06:26 +00:00
wiz	06d626133c	Update to 2.01: August 27 2007 - V2.01 Fixed UTF8 input problems with box file reader. Fixed various infinite loops and crashes in dawg code. Removed include of config_auto.h from host.h. Added automatic wctype encoding to unicharset_extractor. Fixed dawg table too full error. Removed svn files from tarball. Added new functions to tessdll. Increased maximum utf8 string in a classification result to 8.	2007-11-29 16:42:08 +00:00
wiz	1da043e250	Update to 2.00, provided by Rumko on pkgsrc-users. July 02 2007 - V2.00 Converted internal character handling to UTF8. Trained with 6 languages. Added unicharset_extractor, wordlist2dawg. Added boxfile creation mode. Added UNLV regression test capability. Fixed problems with copyright and registered symbols. Fixed extern "C" declarations problem.	2007-07-28 01:02:14 +00:00
wiz	e899e6021c	Initial import of tesseract-1.04b from pkgsrc-wip (packaged by heinz@ and myself): This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT FORMATTING, and NO UI. It can only process an image of a single column and create text from it. It can detect fixed pitch vs proportional text. Having said that, in 1995, this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Another current limitation is that it only recognizes English and its character set is only US-ASCII. Training code IS included in the open source release however, and will be included in a future release.	2007-05-18 06:39:27 +00:00

11 commits