Changes:
* MimeLineReader.cc: 1.0 branch - fixed MBX record header regex
* spamprobe.cc (main): Added exec and exec-shared commands.
(import_words): modified import command to allow negative values
to be specified in the import file.
* Applied patches for configure.in and aclocal.m4 contributed by
Siggy Brentrup for debian compatibility.
* FrequencyDBImpl_pbl.cc: Invokes new WordData methods to allow
storing data in big endian format.
* WordData.h: Added optional support for storing counts/flags
in big endian order for data portability.
* MimeLineReader.cc (readMBXFileHeader): UW IMAP MBX file format
is now auto detected from the first line of the mailbox file.
* spamprobe.cc (process_extended_options): Removed -o imap-mbx
option.
* spamprobe.cc (process_extended_options): Added -o imap-mbx
option to process files as WU-IMAP MBX files rather than mbox
files.
* MimeLineReader.cc (readLine): Added support for WU-IMAP MBX file
format.
* spamprobe.cc (process_stream): Added -o tokenized option
to allow people to use an external tokenizer with spamprobe.
* SpamFilter.cc (scoreToken): Reduced sorting overhead by
pre-computing and integer sort value with sorting priorities
reflected in the value. This eliminates several calculations
inside of the sort routine.
* SpamFilter.cc (computeRatio): Capped ratios in calculations to
within MIN_PROB and MAX_PROB. Widened that range. This avoids
problems with div/0 and makes it easier to sort terms.
* spamprobe.cc (dump_words): dump command can now optionally
accept a regular expression as an argument and will only dump
terms matching the regular expression.
(purge_terms): Added purge-terms command to purge from the
database all terms matching a regular expression.
* spamprobe.cc (main): Fixed bug in command line processing.
Thanks to Jem for bug report.
* spamprobe.cc (train_on_message): Code simplified. Eliminated
redundant recalculation of scores.
(train_on_message): Timestamps are now longer updated by
train-spam and train-good commands. They are still updated by
train command.
(main): Fixed assertion if -P option is specified in a read only
operation.
* spamprobe.cc (main): Added -C command line option to allow users
to specify their own min word count.
* SpamFilter.cc (SpamFilter): Set default minimum word count back
to 5 (was 3).
* spamprobe.cc (process_extended_options): Removed "alt-score"
from -o options list because it distributes scores poorly. New
formula achieves the same end with better accuracy. Added
"orig-score" option to allow people to continue using the old
formula. Added "honor-xstatus-header" option for people whose
mail server uses X-Status: rather than Status: for the deleted
flag.
(main): Added -l command line option to allow people to set
their own spam threshold if they don't like the default value.
* SpamFilter.cc (scoreMessage): Added a new scoring formula based
on Paul's but taking the nth root of spam and good probabilities
to produce more evenly distributed scores. Lowered the spam
threshold to 0.6 to keep accuracy about the same as the original
formula. Highest score seen for a ham so far in tests is 0.44
so 0.6 seems safe. Made the new formula the default instead of
Paul's.
in the process. (More information on tech-pkg.)
Bump PKGREVISION and BUILDLINK_DEPENDS of all packages using libtool and
installing .la files.
Bump PKGREVISION (only) of all packages depending directly on the above
via a buildlink3 include.
* New manual page
* spamprobe.cc (process_stream): Added -o tokenized option
to allow people to use an external tokenizer with spamprobe.
* SpamFilter.cc (scoreToken): Reduced sorting overhead by
pre-computing and integer sort value with sorting priorities
reflected in the value. This eliminates several calculations
inside of the sort routine.
* SpamFilter.cc (computeRatio): Capped ratios in calculations to
within MIN_PROB and MAX_PROB. Widened that range. This avoids
problems with div/0 and makes it easier to sort terms.
* spamprobe.cc (dump_words): dump command can now optionally
accept a regular expression as an argument and will only dump
terms matching the regular expression.
(purge_terms): Added purge-terms command to purge from the
database all terms matching a regular expression.
Patch submitted by Ossi Herrala <PGP: 0x78CD0337> in private email.
Patch provided by Ossi Herrala <$MAINTAINER> in private mail.
CHANGED:
* This release moves from Paul's original formula to a slightly modified
one that yields more evenly distributed scores. To continue using old
formula, use -o orig-score command line option.
ADDED:
* -C <number> command line option. This tells SpamProbe to assign a default,
somewhat neutral, probability to any term that does not have a weighted
(good count doubled) count of at least specified number in the database.
This prevents terms which have been seen only a few times from having
an unreasonable influence on the score of an email containing them.
Default count have changed. It is now 5. Old was 3.
* Added -o <option_name> command line option to specify alternate way of
scoring. Consult README.txt for more info.
* Added -l <number> command line option. Changes the spam probability
threshold for emails from the default (0.7) to <number>.
* Added tokenize command. Prints the tokens found in the file one word
per line in human readable format with spam probability, good count,
spam count, message count, and word in columns separated by whitespace.
IMPROVED:
* -H command line option to add more headers to scan.
* Improved performance by removing some redundant calculations and
reducing the amount of I/O in train-* mode.
Spamprobe is a fast, intelligent, automatic spam detector using Bayesian
analysis of terms contained in emails. Works with procmail, maildrop or a
similar tool to produce a complete server or client side spam filtering
system.
Provided by Daniel Farrugia in PR#20286, buildlinkified by me.