Commit graph

27 commits

Author SHA1 Message Date
ryoon
36ed025474 Recursive revbump from textproc/icu 58.1 2016-12-04 05:17:03 +00:00
ryoon
ac20a93574 Recursive revbump from textproc/icu 57.1 2016-04-11 19:01:33 +00:00
joerg
80dd09bca7 Needs pkg-config. 2016-04-03 12:46:18 +00:00
fhajny
55a7b53210 Make sure leptonica is detected properly 2016-03-30 11:38:59 +00:00
fhajny
d46db02864 Update graphics/tesseract to 3.04.01.
Move to new home at Github. Clean up.

2015-02-17 - V3.04.01
- Added OSD renderer for psm 0. Works for single page and
  multi-page images.
- Improve tesstrain.sh script.
- Simplify build and run of ScrollView.
- Improved PDF output for OS X Preview utility.
- INCOMPATIBLE fix to hOCR line height information - commit
  134ebc3.
- Added option to build Tesseract without Cube OCR engine
  (-DNO_CUBE_BUILD).
- Enable OpenMP support.
- Many bug fixes.

2015-07-11 - V3.04.00
- Tesseract development is now done with Git and hosted at
  github.com (Previously we used Subversion as a VCS and
  code.google.com for hosting).
- Tesseract now requires leptonica 1.71 or a higher version.
- Removed official support for VS 2008.
- Added support for 39 additional scripts/languages, including:
  amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat,
  iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya,
  nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd,
  uzb, uzb_cyrl, yid
- Major updates to training system as a result of extensive
  testing on 100 languages.
- New training data for over 100 languages
- Improved performance with PIC compilation option.
- Significant change to invisible font system in pdf output to
  improve correctness and compatibility with external programs,
  particularly ghostscript.
- Improved font identification.
- Major change to improve layout analysis for heavily diacritic
  languages: Thai, Vietnamese, Kannada, Telugu etc.
- Fixed problems with shifted baselines so recognition can recover
  from layout analysis errors.
- Major refactor to improve speed on difficult images, especially
  when running a heap checker.
- Moved params from global in page layout to tesseractclass.
- Improved single column layout analysis.
- Allow ocr output to multiple formats using tesseract command
  line executable.
- Fixed issues with mixed eng+ara scripts.
- Improved script consistency in numbers.
- Major refactor of control.cpp to enable line recognition.
- Added tesstrain.sh - a master training script.
- Added ability to text2image training tool to just list available
  fonts.
- Added ability to text2image to underline words.
- Improved efficiency of image processing for PDF output.
- Added parameter description for each parameter listed with
  'print-parameters' command line option.
- Added font info to hOCR output.
- Enabled streaming input and output of multi-page documents.
- Many bug fixes.

2014-02-04 - V3.03(rc1)
- Added new training tool text2image to generate box/tif file
  pairs from text and truetype fonts.
- Added support for PDF output with searchable text.
- Removed entire IMAGE class and all code in image directory.
- Tesseract executable: support for output to stdout; limited
  support for one
  page images from stdin  (especially on Windows)
- Added Renderer to API to allow document-level processing and
  output of document formats, like hOCR, PDF.
- Major refactor of word-level recognition, beam search,
  eliminating dead code.
- Refactored classifier to make it easier to add new ones.
- Generalized feature extractor to allow feature extraction from
  greyscale.
- Improved sub/superscript treatment.
- Improved baseline fit.
- Added set_unicharset_properties to training tools.
- Many bug fixes.
- More training source data included.
2016-03-17 12:51:14 +00:00
adam
011bef3059 Revbump after updating graphics/libwebp 2016-01-06 10:46:49 +00:00
agc
7f810a359f Add SHA512 digests for distfiles for graphics category
Problems found with existing digests:
	Package fotoxx distfile fotoxx-14.03.1.tar.gz
	ac2033f87de2c23941261f7c50160cddf872c110 [recorded]
	118e98a8cc0414676b3c4d37b8df407c28a1407c [calculated]
	Package ploticus-examples distfile ploticus-2.00/plnode200.tar.gz
	34274a03d0c41fae5690633663e3d4114b9d7a6d [recorded]
	da39a3ee5e6b4b0d3255bfef95601890afd80709 [calculated]

Problems found locating distfiles:
	Package AfterShotPro: missing distfile AfterShotPro-1.1.0.30/AfterShotPro_i386.deb
	Package pgraf: missing distfile pgraf-20010131.tar.gz
	Package qvplay: missing distfile qvplay-0.95.tar.gz

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
2015-11-03 21:33:50 +00:00
fhajny
504dbb14c0 Network libs still needed, fix build on SunOS. 2015-10-07 11:26:22 +00:00
adam
243c29c4cc Revbump after updating libwebp and icu 2014-10-07 16:47:10 +00:00
adam
ccf89ed63f Changes 3.02.02:
* Moved ResultIterator/PageIterator to ccmain.
* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
* Added paragraph detection in layout analysis/post OCR.
* Fixed inconsistent xheight during training and over-chopping.
* Added simultaneous multi-language capability.
* Refactored top-level word recognition module.
* Added experimental equation detector.
* Improved handling of resolution from input images.
* Blamer module added for error analysis.
* Cleaned up externally used namespace by removing includes from baseapi.h.
* Removed dead memory mangagement code.
* Tidied up constraints on control parameters.
* Added support for ShapeTable in classifier and training.
* Refactored class pruner.
* Fixed training leaks and randomness.
* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
* Improved line detection and removal.
* Added fixed pitch chopper for CJK.
* Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
* Fixed problems with internally scaled images.
* Added page and bbox to string in tr files to identify source of training data better.
* Fixes to Hindi Shiroreka splitter.
* Added word bigram correction.
* Reduced stack memory consumption and eliminated some ugly typedefs.
* Added new uniform classifier API.
* Added new training error counter.
* Fixed endian bug in dawg reader.
* Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2014-10-02 16:06:02 +00:00
jperkin
4fff5a02b8 SunOS needs -lsocket -lnsl. 2014-09-23 18:55:24 +00:00
joerg
3769fa0bfc Add a number of includes hidden by libstdc++'s name space pollution. 2013-04-29 21:31:09 +00:00
adam
f4c3b89da7 Revbump after graphics/jpeg and textproc/icu 2013-01-26 21:36:13 +00:00
marino
f50901c296 graphics/tesseract: #include <unistd.h>
Fixes out-of-scope errors seen on gcc 4.7.x
2012-11-23 23:52:33 +00:00
asau
08f35c7155 Drop superfluous PKG_DESTDIR_SUPPORT, "user-destdir" is default these days. 2012-10-06 14:10:39 +00:00
wiz
5a1e8b0499 Revbump for
a) tiff update to 4.0 (shlib major change)
b) glib2 update 2.30.2 (adds libffi dependency to buildlink3.mk)

Enjoy.
2012-02-06 12:40:37 +00:00
dholland
2313ef244d Add missing <stdio.h>, should fix or improve linux build 2011-11-14 02:44:40 +00:00
wiz
91871f449e Second try at jpeg-8 recursive PKGREVISION bump. 2010-01-18 09:58:37 +00:00
sno
6f7368d4db bump revision because of graphics/jpeg update 2009-08-26 19:56:37 +00:00
wiz
d417e89789 Update to 2.04. Set LICENSE.
June 30 2009 - V2.04
	  Integrated bug fixes and patches and misc changes for portability.
	  Integrated a patch to remove some of the "access" macros.
	  Removed dependence on lua from the viewer, speeding it up
	  dramatically.
	  Fixed the viewer so it compiles and runs properly!
	  Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,
	  112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,
	  165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169
2009-07-22 20:57:47 +00:00
brook
716ab7596a Add language-specific data sets distributed by the project. The tesseract
distribution itself just creates dummy, placeholder data sets that cannot
be used.
2009-07-21 16:00:19 +00:00
joerg
3a3c07bc30 Remove @dirrm entries from PLISTs 2009-06-14 17:59:04 +00:00
wiz
4358c8cac0 Replace patch-ab with a post-extract rule. No change to the binary package,
just one file less in pkgsrc ;)
2008-10-30 22:12:59 +00:00
wiz
b4a554e958 Update to 2.03:
January 23 2008 - V2.02
          Improvements to clustering, training and classifier.
          Major internationalization improvements for large-character-set
          languages, eg Kannada.
          Removed some compiler warnings.
          Added multipage tiff support for training and running.
          Updated graphics output to talk to new java-based viewer.
          Added ability to save n-best lists.
          Added leptonica support for more file types.
          Improved Init/End to make them safe.
          Reduced memory use of dictionaries.
          Added some new APIs to TessBaseAPI.
April 21 2008 - V2.02 (again)
          Fixed namespace collisions with jpeg library (INT32).
          Portability fixes for Windows for new code.
          Updates to autoconf system for new code.
April 22 2008 - V2.03
          Fixed crash introduced in 2.02.
	  Fixed lack of tessembedded.cpp in distribution.
	  Added test for leptonica header files and conditional test for lib.
2008-05-30 13:06:26 +00:00
wiz
06d626133c Update to 2.01:
August 27 2007 - V2.01
	  Fixed UTF8 input problems with box file reader.
	  Fixed various infinite loops and crashes in dawg code.
	  Removed include of config_auto.h from host.h.
	  Added automatic wctype encoding to unicharset_extractor.
	  Fixed dawg table too full error.
	  Removed svn files from tarball.
	  Added new functions to tessdll.
	  Increased maximum utf8 string in a classification result to 8.
2007-11-29 16:42:08 +00:00
wiz
1da043e250 Update to 2.00, provided by Rumko on pkgsrc-users.
July 02 2007 - V2.00
	  Converted internal character handling to UTF8.
	  Trained with 6 languages.
	  Added unicharset_extractor, wordlist2dawg.
	  Added boxfile creation mode.
	  Added UNLV regression test capability.
	  Fixed problems with copyright and registered symbols.
	  Fixed extern "C" declarations problem.
2007-07-28 01:02:14 +00:00
wiz
e899e6021c Initial import of tesseract-1.04b from pkgsrc-wip (packaged by heinz@
and myself):

This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO
OUTPUT FORMATTING, and NO UI. It can only process an image of a
single column and create text from it. It can detect fixed pitch
vs proportional text.  Having said that, in 1995, this engine was
in the top 3 in terms of character accuracy, and it compiles and
runs on both Linux and Windows. Another current limitation is that
it only recognizes English and its character set is only US-ASCII.
Training code IS included in the open source release however, and
will be included in a future release.
2007-05-18 06:39:27 +00:00