py-snowballstemmer: updated to 2.2.0

Snowball 2.2.0 (2021-11-10)
===========================

New Code Generators
-------------------

* Add Ada generator from Stephane Carrez

Javascript
----------

* Fix generated code to use integer division rather than floating point
  division.

  Noted by David Corbett.

Pascal
------

* Fix code generated for division.  Previously real division was used and the
  generated code would fail to compile with a "Incompatible types" error.

  Noted by David Corbett.

* Fix code generated for Snowball's `minint` and `maxint` constant.

Python
------

* Python 2 is no longer actively supported, as proposed on the mailing list:
  https://lists.tartarus.org/pipermail/snowball-discuss/2021-August/001721.html

* Fix code generated for division.  Previously the Python code we generated
  used integer division but rounded negative fractions towards negative
  infinity rather than zero under Python 2, and under Python 3 used floating
  point division.

  Noted by David Corbett.

Code Quality Improvements
-------------------------

* C#: An `among` without functions is now generated as `static` and groupings
  are now generated as constant.

Code generation improvements
----------------------------

* General:

  + Constant numeric subexpressions and constant numeric tests are now
    evaluated at Snowball compile time.

Behavioural changes to existing algorithms
------------------------------------------

* german2: Fix handling of `qu` to match algorithm description.  Previously
  the implementation erroneously did `skip 2` after `qu`.  We suspect this was
  intended to skip the `qu` but that's already been done by the substring/among
  matching, so it actually skips an extra two characters.

  The implementation has always differed in this way, but there's no good
  reason to skip two extra characters here so overall it seems best to change
  the code to match the description.  This change only affects the stemming of
  a single word in the sample vocabulary - `quae` which seems to actually be
  Latin rather than German.

Optimisations to existing algorithms
------------------------------------

* arabic: Handle exception cases in the among they're exceptions to.

* greek: Remove unused slice setting, handle exception cases in the among
  they're exceptions to, and turn `substring ... among ...  or substring ...
  among ...` into a single `substring ... among ...` in cases where it is
  trivial to do so.

* hindi: Eliminate the need for variable `p`.

* irish: Minor optimisation in setting `pV` and `p1`.

* yiddish: Make use of `among` more.

Compiler
--------

* Fix handling of `len` and `lenof` being declared as names.

  For compatibility with programs written for older Snowball versions
  len and lenof stop being tokens if declared as names.  However this
  code didn't work correctly if the tokeniser's name buffer needed to
  be enlarged to hold the token name (i.e. 3 or 5 elements respectively).

* Report a clearer error if `=` is used instead of `==` in an integer test.

* Replace a single entry command list with its contents in the internal syntax
  tree.  This puts things in a more canonical form, which helps subsequent
  optimisations.

Build system
------------

* Support building on Microsoft Windows (using mingw+msys or a similar
  Unix-like environment).

* Split out INCLUDES from CPPFLAGS so that CPPFLAGS can now be overridden by
  the user if required.

* Regenerate algorithms.mk only when needed rather than on every `make` run.

libstemmer
----------

* The libstemmer static library now has a `.a` extension, rather than `.o`.

Testsuite
---------

* stemtest: Test that numbers and numeric codes aren't damaged by any of the
  algorithms.

* ada: Fix ada tests to fail if output differs.  There was an extra `| head
  -300` compared to other languages, which meant that the exit code of `diff`
  was ignored.  It seems more helpful (and is more consistent) not to limit how
  many differences are shown so just drop this addition.

* go: Stop thinning testdata.  It looks like we only are because the test
  harness code was based on that for rust, which was based on that for
  javascript, which was only thinning because it was reading everything into
  memory and the larger vocabulary lists were resulting in out of memory
  issues.

* javascript: Speed up stemwords.js.  Process input line-by-line rather than
  reading the whole file into memory, splitting, iterating, and creating an
  array with all the output, joining and writing out a single huge string.
  This also means we can stop thinning the test data for javascript, which we
  were only doing because the huge arabic test data file was causing out of
  memory errors.  Also drop the -p option, which isn't useful here and
  complicates the code.

* rust: Turn on optimisation in the makefile rather than the CI config.  This
  makes the tests run in about 1/5 of the time and there's really no reason to
  be thinning the testdata for rust.

Documentation
-------------

* CONTRIBUTING.rst: Improve documentation for adding a new stemming algorithm.

* Improve wording of Python docs.
This commit is contained in:
adam 2021-11-18 19:38:01 +00:00
parent d84fa9b51d
commit 574dd30ecc
2 changed files with 6 additions and 6 deletions

View file

@ -1,6 +1,6 @@
# $NetBSD: Makefile,v 1.5 2021/02/09 10:28:26 adam Exp $
# $NetBSD: Makefile,v 1.6 2021/11/18 19:38:01 adam Exp $
DISTNAME= snowballstemmer-2.1.0
DISTNAME= snowballstemmer-2.2.0
PKGNAME= ${PYPKGPREFIX}-${DISTNAME}
CATEGORIES= textproc python
MASTER_SITES= ${MASTER_SITE_PYPI:=s/snowballstemmer/}

View file

@ -1,5 +1,5 @@
$NetBSD: distinfo,v 1.7 2021/10/26 11:23:13 nia Exp $
$NetBSD: distinfo,v 1.8 2021/11/18 19:38:01 adam Exp $
BLAKE2s (snowballstemmer-2.1.0.tar.gz) = cc580da7781577e95be41df302c6ba5650f18e0d6d09a527870b1c8c1351aee3
SHA512 (snowballstemmer-2.1.0.tar.gz) = e0550d3389074d7686d26397ff2289519cd8b26cf7090fe781d6407d1c2b95f912347d70cd25e02d6016c454ad6c5cf6d648e54ef87161328ac57bc1ceaf7826
Size (snowballstemmer-2.1.0.tar.gz) = 85674 bytes
BLAKE2s (snowballstemmer-2.2.0.tar.gz) = 7003153e7592ed98d73f2748d7b7103568a53acfc6367ace7568e5103005ac7a
SHA512 (snowballstemmer-2.2.0.tar.gz) = f1dee83e06fc79ffb250892fe62c75e3393b9af07fbf7cde413e6391870aa74934302771239dea5c9bc89806684f95059b00c9ffbcf7340375c9dd8f1216cd37
Size (snowballstemmer-2.2.0.tar.gz) = 86699 bytes