Commit graph

69 commits

Author SHA1 Message Date
adam
f5e35d538b revbump for textproc/icu update 2022-04-18 19:09:40 +00:00
adam
b6d9bd86bc revbump for icu and libffi 2021-12-08 16:01:42 +00:00
nia
f8331b5844 graphics: Replace RMD160 checksums with BLAKE2s checksums
All checksums have been double-checked against existing RMD160 and
SHA512 hashes
2021-10-26 10:45:53 +00:00
nia
84d3786e88 graphics: Remove SHA1 hashes for distfiles 2021-10-07 14:11:55 +00:00
jperkin
cd160ad5c5 tesseract: Avoid C++ <version> issue on macOS. 2021-07-16 09:16:27 +00:00
adam
9d0e79c401 revbump for textproc/icu 2021-04-21 11:40:12 +00:00
ryoon
2831546220 *: Recursive revbump from textproc/icu-68.1 2020-11-05 09:07:25 +00:00
leot
953ab724e1 *: revbump after fontconfig bl3 changes (libuuid removal) 2020-08-17 20:19:01 +00:00
jperkin
38fe454b9c *: Apply revbump for graphics/giflib API change. 2020-06-05 12:48:58 +00:00
adam
6bd0c30da6 Revbump for icu 2020-06-02 08:22:31 +00:00
adam
24daafa112 Recursive revision bump after textproc/icu update 2020-04-12 08:27:48 +00:00
wiz
4e3b1b97c2 librsvg: update bl3.mk to remove libcroco in rust case
recursive bump for the dependency change
2020-03-10 22:08:37 +00:00
wiz
f669fda471 *: recursive bump for libffi 2020-03-08 16:47:24 +00:00
adam
ed48d34938 tesseract: updated to 4.1.1
4.1.1 Release:
Implemented sw build (cppan is depreciated)
Improved cmake build
Code cleanup and optimization
A lot of bug fixes...
2019-12-29 16:44:12 +00:00
adam
8990085c82 tesseract: updated to 4.1.0
4.1.0 Release
Added new renders Alto, LSTMBox, WordStrBox.
Added character boxes in hOCR output.
Added python training scripts (experimental) as alternative shell scripts.
Better support AVX / AVX2 / SSE.
Disable OpenMP support by default.
Fix for bounding box problem.
Implemented support for whitelist/blacklist in LSTM engine.
Improved cmake configuration.
Code modernization and improvements.
A lot of bug fixes...
2019-07-08 18:37:03 +00:00
leot
92ed1bcdec tesseract: Avoid unportable `=' test(1) operator
PKGREVISION++

(There should be no change, i.e. the test(1) code path seems still never
crossed, but bump it for extra paranoia.)
2019-05-04 16:05:33 +00:00
ryoon
6fc378bce9 Recursive revbump from textproc/icu 2019-04-03 00:32:25 +00:00
gutteridge
9fcf300adf graphics/tesseract: update DESCR
The DESCR was about a decade out of date, revise to reflect 4.0.
2019-01-16 00:07:49 +00:00
adam
16dd5de231 revbump after updating textproc/icu 2018-12-09 18:51:58 +00:00
adam
ae4086588c tesseract: fix manpage formatting 2018-11-29 09:15:22 +00:00
adam
3f1bb6b94c tesseract: build depends on asciidoc 2018-11-28 12:04:20 +00:00
adam
2cbce7e2fa tesseract: use REPLACE_BASH; fix building man-pages; courtesy of Mustafa D. :) 2018-11-18 18:07:20 +00:00
kleink
f1a683c990 Revbump after cairo 1.16.0 update. 2018-11-14 22:20:58 +00:00
ryoon
b86dfe6873 Recursive revbump from hardbuzz-2.1.1 2018-11-12 03:51:07 +00:00
adam
c7ab36c3d8 tesseract: updated to 4.0.0
V4.0.0:
New OCR engine
- Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.
- This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model.
- Added trained data that includes LSTM models to 123 languages.
- Added optional accelerated code paths for the LSTM recognizer:
  * Using OpenMP
  * Using SIMD: AVX2 / AVX / SSE4.1
- Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output.
- The new LSTM engine still does not support all features from the old legacy engine (see missing features).

Other OCR engines
- The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version.
- Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed.

Updated build system
- Tesseract now uses semantic versioning.
- Tesseract now requires Leptonica 1.74.0 or a higher version.
- For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers.
- Added unit tests to the main repo. The unit tests require Git submodules and the code for training.
- Added an option to compile Tesseract without the code of the legacy OCR engine.
- Update minimum required autoconf version to 2.63.
- Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0.
- Reorganized Tesseract's source tree. Most sources are now below the src directory.

Bug fixes and enhancements
- Fixed many issues that triggered compiler warnings.
- Fixed many issues reported by Coverity Scan or LGTM.
- Fixes to trainingdata rendering.
- Fixed damage to binary images when processing PDFs.
- Don't trigger a deliberate segmentation fault for fatal errors in release code.
- Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine.
- Improved multi-page TIFF handling.
- Improvements to PDF rendering.
- Added version information and improved help texts to the training tools.
- Added faster version of log2().
- Documented in tesseract man page the option to use an input text file which contains lists of images.
- Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API).
- Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired.
- The list of available languages and scripts is now sorted alphabetically.
- Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4.
- Removed obsolete code.
2018-11-03 09:13:07 +00:00
ryoon
b9c1e1d533 Recursive revbump from textproc/icu-62.1 2018-07-20 03:33:47 +00:00
adam
3b3b118437 tesseract: updated to 3.05.02
V3.05.02
* Fixed linking with Leptonica
* Fix build for Mingw-w64
* Fix Training error "Couldn't find a matching blob"
* Fix unterminated string
2018-06-22 09:50:16 +00:00
fhajny
6b471d1791 graphics/tesseract: Revert update to data version 4.00. Using version 4 data with version 3 program is not supported. Fixes https://github.com/joyent/pkgsrc/issues/113. 2018-06-11 15:01:49 +00:00
adam
7e8d537bd1 tesseract: added buildlink3; fixed COMMENT and HOMEPAGE 2018-04-29 10:16:20 +00:00
wiz
8ee21bdcf0 Recursive bump for new fribidi dependency in pango. 2018-04-16 14:33:44 +00:00
adam
299d329d51 revbump after icu update 2018-04-14 07:33:52 +00:00
wiz
c57215a7b2 Recursive bumps for fontconfig and libzip dependency changes. 2018-03-12 11:15:24 +00:00
adam
42bdd074eb tesseract: updated tessdata to 4.00 2018-01-25 11:30:34 +00:00
adam
8977d31a36 Revbump after textproc/icu update 2017-11-30 16:45:00 +00:00
maya
33ebf687dc revbump for requiring ICU 59.x 2017-09-18 09:52:56 +00:00
wiz
5d86518619 Switch github HOMEPAGEs to https. 2017-07-30 22:32:10 +00:00
fhajny
9c6a529594 Update graphics/tesseract to 3.05.01.
- Fixed several build issues
- Fixed C-API
- Backport pdfrenderer changes
- Code clean up
2017-06-14 14:41:26 +00:00
adam
75a9285105 Revbump after icu update 2017-04-22 21:03:07 +00:00
ryoon
50aefac5f6 Recursive revbump from graphics/libwebp 2017-02-28 15:19:58 +00:00
fhajny
f9aeccef57 Update graphics/tesseract to 3.05.00
- Made some fine tuning to the hOCR output.
- Added TSV as another optional output format.
- Fixed ABI break introduced in 3.04.00 with the AnalyseLayout()
  method.
- text2image tool - Enable all OpenType ligatures available in a font.
  This feature requires Pango 1.38 or newer.
- Training tools - Replaced asserts with tprintf() and exit(1).
- Fixed Cygwin compatibility.
- Improved multipage tiff processing.
- Improved the embedded pdf font (pdf.ttf).
- Enable selection of OCR engine mode from command line.
- Changed tesseract command line parameter '-psm' to '--psm'.
- Added new C API for orientation and script detection, removed the
  old one.
- Increased minimum autoconf version to 2.59.
- Removed dead code.
- Fixed many compiler warning.
- Fixed memory and resource leaks.
- Fixed some issues with the 'Cube' OCR engine.
- Fixed some openCL issues.
- Added option to build Tesseract with CMake build system.
- Implemented CPPAN support for easy Windows building.
2017-02-21 17:51:18 +00:00
ryoon
72c3cb198b Recursive revbump from fonts/harfbuzz 2017-02-12 06:24:36 +00:00
wiz
7ac05101c6 Recursive bump for harfbuzz's new graphite2 dependency. 2017-02-06 13:54:36 +00:00
ryoon
36ed025474 Recursive revbump from textproc/icu 58.1 2016-12-04 05:17:03 +00:00
ryoon
ac20a93574 Recursive revbump from textproc/icu 57.1 2016-04-11 19:01:33 +00:00
joerg
80dd09bca7 Needs pkg-config. 2016-04-03 12:46:18 +00:00
fhajny
55a7b53210 Make sure leptonica is detected properly 2016-03-30 11:38:59 +00:00
fhajny
d46db02864 Update graphics/tesseract to 3.04.01.
Move to new home at Github. Clean up.

2015-02-17 - V3.04.01
- Added OSD renderer for psm 0. Works for single page and
  multi-page images.
- Improve tesstrain.sh script.
- Simplify build and run of ScrollView.
- Improved PDF output for OS X Preview utility.
- INCOMPATIBLE fix to hOCR line height information - commit
  134ebc3.
- Added option to build Tesseract without Cube OCR engine
  (-DNO_CUBE_BUILD).
- Enable OpenMP support.
- Many bug fixes.

2015-07-11 - V3.04.00
- Tesseract development is now done with Git and hosted at
  github.com (Previously we used Subversion as a VCS and
  code.google.com for hosting).
- Tesseract now requires leptonica 1.71 or a higher version.
- Removed official support for VS 2008.
- Added support for 39 additional scripts/languages, including:
  amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat,
  iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya,
  nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd,
  uzb, uzb_cyrl, yid
- Major updates to training system as a result of extensive
  testing on 100 languages.
- New training data for over 100 languages
- Improved performance with PIC compilation option.
- Significant change to invisible font system in pdf output to
  improve correctness and compatibility with external programs,
  particularly ghostscript.
- Improved font identification.
- Major change to improve layout analysis for heavily diacritic
  languages: Thai, Vietnamese, Kannada, Telugu etc.
- Fixed problems with shifted baselines so recognition can recover
  from layout analysis errors.
- Major refactor to improve speed on difficult images, especially
  when running a heap checker.
- Moved params from global in page layout to tesseractclass.
- Improved single column layout analysis.
- Allow ocr output to multiple formats using tesseract command
  line executable.
- Fixed issues with mixed eng+ara scripts.
- Improved script consistency in numbers.
- Major refactor of control.cpp to enable line recognition.
- Added tesstrain.sh - a master training script.
- Added ability to text2image training tool to just list available
  fonts.
- Added ability to text2image to underline words.
- Improved efficiency of image processing for PDF output.
- Added parameter description for each parameter listed with
  'print-parameters' command line option.
- Added font info to hOCR output.
- Enabled streaming input and output of multi-page documents.
- Many bug fixes.

2014-02-04 - V3.03(rc1)
- Added new training tool text2image to generate box/tif file
  pairs from text and truetype fonts.
- Added support for PDF output with searchable text.
- Removed entire IMAGE class and all code in image directory.
- Tesseract executable: support for output to stdout; limited
  support for one
  page images from stdin  (especially on Windows)
- Added Renderer to API to allow document-level processing and
  output of document formats, like hOCR, PDF.
- Major refactor of word-level recognition, beam search,
  eliminating dead code.
- Refactored classifier to make it easier to add new ones.
- Generalized feature extractor to allow feature extraction from
  greyscale.
- Improved sub/superscript treatment.
- Improved baseline fit.
- Added set_unicharset_properties to training tools.
- Many bug fixes.
- More training source data included.
2016-03-17 12:51:14 +00:00
adam
011bef3059 Revbump after updating graphics/libwebp 2016-01-06 10:46:49 +00:00
agc
7f810a359f Add SHA512 digests for distfiles for graphics category
Problems found with existing digests:
	Package fotoxx distfile fotoxx-14.03.1.tar.gz
	ac2033f87de2c23941261f7c50160cddf872c110 [recorded]
	118e98a8cc0414676b3c4d37b8df407c28a1407c [calculated]
	Package ploticus-examples distfile ploticus-2.00/plnode200.tar.gz
	34274a03d0c41fae5690633663e3d4114b9d7a6d [recorded]
	da39a3ee5e6b4b0d3255bfef95601890afd80709 [calculated]

Problems found locating distfiles:
	Package AfterShotPro: missing distfile AfterShotPro-1.1.0.30/AfterShotPro_i386.deb
	Package pgraf: missing distfile pgraf-20010131.tar.gz
	Package qvplay: missing distfile qvplay-0.95.tar.gz

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
2015-11-03 21:33:50 +00:00
fhajny
504dbb14c0 Network libs still needed, fix build on SunOS. 2015-10-07 11:26:22 +00:00