41fd565306
WARNING! Database format has changed. Rebuild required.
321 lines
10 KiB
Groff
321 lines
10 KiB
Groff
.\"
|
|
.\" $Id$
|
|
.\"
|
|
.\" Note: The date here should be updated whenever a non-trivial
|
|
.\" change is made to the manual page.
|
|
.Dd September 5, 2002
|
|
.Dt SPAMPROBE 1
|
|
.Os
|
|
.Sh NAME
|
|
.Nm spamprobe
|
|
.Nd "Spam detector using Bayesian analysis of word counts."
|
|
.Sh SYNOPSIS
|
|
.Nm
|
|
.Op Fl a Ar char
|
|
.Op Fl c
|
|
.Op Fl d Ar directory
|
|
.Op Fl h
|
|
.Op Fl H Ar option
|
|
.Op Fl m
|
|
.Op Fl n Ar number
|
|
.Op Fl r Ar number
|
|
.Op Fl s Ar number
|
|
.Op Fl v
|
|
.Op Fl V
|
|
.Op Fl Y
|
|
.Op Fl 7
|
|
.Op Fl 8
|
|
.Ar command Op ...
|
|
.Nm
|
|
.Ar receive Op filename ...
|
|
.Nm
|
|
.Ar score Op filename ...
|
|
.Nm
|
|
.Ar find-spam Op filename ...
|
|
.Nm
|
|
.Ar find-good Op filename ...
|
|
.Nm
|
|
.Ar good Op filename ...
|
|
.Nm
|
|
.Ar spam Op filename ...
|
|
.Nm
|
|
.Ar remove Op filename ...
|
|
.Nm
|
|
.Ar dump
|
|
.Nm
|
|
.Ar export
|
|
.Nm
|
|
.Ar import Op filename ...
|
|
.Sh DESCRIPTION
|
|
Welcome to
|
|
.Nm SpamProbe !
|
|
Are you tired of the constant bombardment of your inbox by unwanted
|
|
email pushing everything from porn to get rich quick schemes? Have you
|
|
tried other spam filters but become disenchanted with them when you
|
|
realized that their manually generated rule sets weren't updated fast
|
|
enough to keep up with spammers wording changes? Or that they generated
|
|
unwanted false positive scores?
|
|
.Pp
|
|
.Nm SpamProbe
|
|
operates on a different basis entirely. Instead of using pattern matching
|
|
and a set of human generated rules
|
|
.Nm SpamProbe
|
|
relies on a Bayesian analysis
|
|
of the frequency of words used in spam and non-spam emails received by an
|
|
individual person. The process is completely automatic and tailors itself
|
|
to the kinds of emails that each person receives.
|
|
.Ss FEATURES
|
|
.Bl -bullet -offset indent -compact
|
|
.It
|
|
Spam detection using Bayesian analysis of terms contained in each email.
|
|
Words used often in spams but not in good email tend to indicate that a
|
|
message is spam.
|
|
.It
|
|
Written in C++ for good performance. Database access using GDBM for quick
|
|
startup and fast term count retrieval.
|
|
.It
|
|
Recognition and decoding of MIME attachments in quoted-printable and
|
|
base64 encoding. Automatically skips non-text attachments.
|
|
.It
|
|
Counts two word phrases as well as single words for higher precision.
|
|
.It
|
|
Ignores HTML tags in emails for scoring purposes unless the -h command
|
|
line option is used. Many spams use HTML and few humans do so HTML tends
|
|
to become a powerful recognizer of spams. However in the author's opinion
|
|
this also substantially increases the likelihood of false positives if
|
|
someone does send a non-spam email containing HTML tags.
|
|
.Nm SpamProbe
|
|
does pull urls from inside of html tags however since those tend to be
|
|
spammer specific.
|
|
.It
|
|
Locks mboxes and databases using fcntl file locking to avoid problems when
|
|
multiple emails arrive simultaneously.
|
|
.It
|
|
Scores only the Received, Subject, To, From, and Cc headers. All other
|
|
headers are ignored to make it hard for spammers to hide non-spammy words
|
|
in X- headers to fool the filter. The
|
|
.Fl H
|
|
command line option can be used to override this.
|
|
.El
|
|
.Ss OPTIONS
|
|
.Bl -tag -width ".Fl d Ar directory"
|
|
.It Fl a Ar char
|
|
By default
|
|
.Nm
|
|
converts non-ascii characters (characters with the most significant bit
|
|
set to 1) into the letter 'z'. This is useful for lumping all Asian
|
|
characters into a single word for easy recognition. The
|
|
.Fl a
|
|
option allows you to change the character to something else if you don't
|
|
like the letter 'z' for some reason.
|
|
.It Fl c
|
|
Create the database directory if it does not already exist. Normally
|
|
.Nm
|
|
exits with a usage error if the database directory does not already exist.
|
|
.It Fl d Ar directory
|
|
By default
|
|
.Nm
|
|
stores its database in a directory named .spamprobe under your home
|
|
directory. The
|
|
.Fl d
|
|
option allows you to specify a different directory to use. This is
|
|
necessary if your home directory is NFS mounted for example.
|
|
.It Fl h
|
|
By default
|
|
.Nm
|
|
removes HTML markup from the text in emails to help avoid false positives.
|
|
The
|
|
.Fl h
|
|
option allows you to override this behavior and force
|
|
.Nm
|
|
to include words from within HTML tags in its word counts. Note that
|
|
.Nm
|
|
always counts any URLs in hrefs within tags whether
|
|
.Fl h
|
|
is used or not. Use of this option is discouraged. It can increase the
|
|
rate of spam detection slightly but unless the user receives a significant
|
|
amount of HTML emails it also tends to increase the number of false
|
|
positives.
|
|
.It Fl H Ar option
|
|
By default
|
|
.Nm
|
|
only scans a meaningful subset of headers from the email message when
|
|
searching for words to score. The
|
|
.Fl H
|
|
option allows the user to specify additional headers to scan. Legal values
|
|
are "all", "nox", or "normal". "all" scans all headers, "nox" scans all
|
|
headers except those starting with X-, and "normal" scans the normal set
|
|
of headers.
|
|
.It Fl m
|
|
Use mbox format for reading emails in receive mode. Normally
|
|
.Nm
|
|
assumes that the input to receive mode contains a single message so it
|
|
doesn't look for message breaks.
|
|
.It Fl n Ar number
|
|
Changes the number of most significant words/phrases used by
|
|
.Nm
|
|
to calculate the score for each message. Generally this is changed only
|
|
for optimization purposes.
|
|
.It Fl r Ar number
|
|
Changes the number of times that a single word/phrase can occurr in the
|
|
top words array used to calculate the score for each message. Allowing
|
|
repeats reduces the number of words overall (since a single word occupies
|
|
more than one slot) but allows words which occur frequently in the message
|
|
to have a higher weight. Generally this is changed only for optimization
|
|
purposes.
|
|
.It Fl s Ar number
|
|
.Nm
|
|
maintains an in memory cache of the words it has seen in previous messages
|
|
to reduce disk i/o and improve performance. By default the cache is
|
|
flushed and cleared every 250 messages. This number can be changed using
|
|
the
|
|
.Fl s
|
|
option. A value of zero causes
|
|
.NM
|
|
to use 100,000 as the limit which effectively means that the cache will
|
|
only be flushed at program exit (unless you have really enormous mailbox
|
|
files). The cache doesn't affect receive, dump, or export but has a
|
|
significant impact on the others.
|
|
.It Fl v
|
|
Write debugging information to stderr. This can be useful for debugging
|
|
or for seeing which terms
|
|
.Nm
|
|
used to score each email.
|
|
.It Fl V
|
|
Prints version and copyright information and then exits.
|
|
.It Fl Y
|
|
Assume traditional Berkeley mailbox format, ignoring any Content-Length:
|
|
fields.
|
|
.It Fl 7
|
|
Ignore any characters with the most significant bit set to 1 instead of
|
|
mapping them to the letter 'z'.
|
|
.It Fl 8
|
|
Store all characters even if their most significant bit is set to 1.
|
|
.El
|
|
.Pp
|
|
.Ss COMMANDS
|
|
.Bl -tag -width ".Ar find-spam Op filename ..."
|
|
.It Ar receive Op filename ...
|
|
Tells
|
|
.Nm
|
|
to read its standard input (or a file specified after the receive command)
|
|
and score it using the current databases. Once the message has been
|
|
scored the message is classified as either spam or non-spam and its word
|
|
counts are written to the appropriate database. The message's score is
|
|
written to stdout along with a single word. For example:
|
|
.Pp
|
|
.Dl "SPAM 0.99"
|
|
.Pp
|
|
or
|
|
.Pp
|
|
.Dl "GOOD 0.02"
|
|
.It Ar score Op filename ...
|
|
Similar to receive except that the databases are not modified in any way
|
|
and only the score is printed to stdout.
|
|
.It Ar find-spam Op filename ...
|
|
Similar to score except that it prints a short summary and score for each
|
|
message that is determined to be spam. This can be useful when testing.
|
|
.It Ar find-good Op filename ...
|
|
Similar to score except that it prints a short summary and score for each
|
|
message that is determined to be good. This can be useful when testing.
|
|
.It Ar good Op filename ...
|
|
Scans each file (or stdin if no file is specified) and reclassifies every
|
|
email in the file as non-spam. The databases are updated appropriately.
|
|
Previously processed messages (recognized using their message ids) are
|
|
ignored.
|
|
.It Ar spam Op filename ...
|
|
Scans each file (or stdin if no file is specified) and reclassifies every
|
|
email in the file as spam. The databases are updated appropriately.
|
|
Previously processed messages (recognized using their message ids) are
|
|
ignored.
|
|
.It Ar remove Op filename ...
|
|
Scans each file (or stdin if no file is specified) and removes its term
|
|
counts from the database. Messages which are not in the database
|
|
(recognized using their message ids) are ignored.
|
|
.It Ar dump
|
|
Prints the contents of the word counts database one word per line in human
|
|
readable format with good count, spam count, and word in columns separated
|
|
by whitespace. Note that when using GDBM for the database the words are
|
|
printed in the order they are hashed so the results will need to be sorted
|
|
to be most useful. The standard unix sort command can do this. For
|
|
example to list all words from "most good" to "least good" use this
|
|
command:
|
|
.Pp
|
|
.Dl "spamprobe dump | sort -k 1 -n -r"
|
|
.Pp
|
|
To list all words from "most spammy" to "least spammy" use this command:
|
|
.Pp
|
|
.Dl "spamprobe dump | sort -k 2 -n -r"
|
|
.It Ar export
|
|
Similar to the dump command but prints the counts and words in a comma
|
|
separated format with the words surrounded by double quotes. This can be
|
|
more useful for importing into some databases.
|
|
.It Ar import Op filename ...
|
|
Reads the specified files which must contain export data written by the
|
|
export command. The terms and counts from this file are added to the
|
|
database. This can be used to convert a database from a prior version.
|
|
.El
|
|
.Sh ENVIRONMENT
|
|
The
|
|
.Nm
|
|
command looks for the database directory in the users home directory
|
|
specified by the
|
|
.Ev HOME
|
|
environment variable. Use the
|
|
.Fl d
|
|
flag to specify a different database directory.
|
|
.Sh FILES
|
|
.Bl -tag -width ".Pa $HOME/. Ns Nm" -compact
|
|
.It Pa $HOME/. Ns Nm
|
|
The default database directory.
|
|
.El
|
|
.Sh EXAMPLES
|
|
Typically one would use
|
|
.Nm
|
|
with
|
|
.Nm procmail
|
|
and
|
|
.Nm formail
|
|
to flag and filter incoming email.
|
|
.Pp
|
|
.Dl "# SpamProbe rule."
|
|
.Dl ":0"
|
|
.Dl "{"
|
|
.Dl " # Generate a score for the message."
|
|
.Dl " SCORE=`spamprobe receive`"
|
|
.Dl " # Add a X-SpamProbe header to the message."
|
|
.Dl " :0 fhW"
|
|
.Dl " | formail -I ""X-SpamProbe: $SCORE"""
|
|
.Dl "}"
|
|
.Pp
|
|
.Dl "# Filter matching messages to their own mailbox."
|
|
.Dl ":0:"
|
|
.Dl "*^X-SpamProbe: SPAM"
|
|
.Dl "spamprobe"
|
|
.Sh DIAGNOSTICS
|
|
Exit status is 0 on success, and 1 if
|
|
.Nm
|
|
encounters an invalid command.
|
|
.Sh COMPATIBILITY
|
|
Version of
|
|
.Nm
|
|
previous to 0.7 use a different database format. To convert your existing
|
|
database to the new format use the following command.
|
|
.Pp
|
|
.Dl "spamprobe-export_0.6 | spamprobe import"
|
|
.Sh SEE ALSO
|
|
.Xr formail 1 ,
|
|
.Xr procmail 1 ,
|
|
.Rs
|
|
.%A "Paul Graham"
|
|
.%T "A Plan for Spam"
|
|
.%O http://www.paulgraham.com/spam.html
|
|
.%D "August 2002"
|
|
.Re
|
|
.Sh AUTHORS
|
|
This
|
|
manual page was written by
|
|
.An Matthew N. Dodd Aq mdodd@FreeBSD.org .
|
|
.Nm
|
|
was written by
|
|
.An Brian Burton Aq bburton@users.sourceforge.net
|