technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization".
It was primarily developed for language guessing, a task on which it is known to
perform with near-perfect accuracy.
WWW: http://software.wise-guys.nl/libtextcat/
.strings files must be distributed in ASCII encoding, which generally
isn't a convenient encoding to do translation in. As an example, its rather
difficult to enter Chinese characters into an ASCII encoded text file.
Localize will, with any luck, help out with this. Currently its just a
shell of an application, but sometime in the future I hope to complete it.
WWW: http://www.eskimo.com/~pburns/Localize/
It provides a shared library to parse, generate, mainpulate and
validate XML documents from within your own application.
(Linux version)
WWW: http://xml.apache.org/xerces-c/
PR: ports/105275
Submitted by: Alexander Logvinov <ports at logvinov.com>
2. Commercial license is also available for embedded use.
Generally, it's a standalone search engine, meant to provide fast,
size-efficient and relevant fulltext search functions to other
applications. Sphinx was specially designed to integrate well with SQL
databases and scripting languages. Currently built-in data sources
support fetching data either via direct connection to MySQL, or from
an XML pipe.
As for the name, Sphinx is an acronym which is officially decoded as
SQL Phrase Index.
WWW: http://www.sphinxsearch.com/
PR: ports/105649
Submitted by: Matthew Seaman <m.seaman at infracaninophile.co.uk>
Unicode::Unihan - The Unihan Data Base 3.2.0
use Unicode::Unihan;
my $db = new Unicode::Unihan;
print join("," => $db->Mandarin("\x{5c0f}\x{98fc}\x{5f3e}"), "\n";
This module provides a user-friendly interface to the Unicode Unihan
Database 3.2. With this module, the Unihan database is as easy as shown in
above.
WWW: http://search.cpan.org/dist/Unicode-Unihan/
2006-11-05 deskutils/offix-trash: development ceased in 1996
2006-11-04 devel/mingw: use mingw32-* ports instead
2006-11-04 devel/mingw-binutils: use mingw32-* ports instead
2006-11-04 devel/mingw-bin-msvcrt: use mingw32-* ports instead
2006-11-04 devel/mingw-gcc: use mingw32-* ports instead
2006-11-04 devel/mingw-opengl-headers: use mingw32-* ports instead
2006-11-05 editors/offix-editor: developement ceased in 1996
2006-11-05 print/offix-printer: development ceased in 1996
2006-11-05 sysutils/wmmon: no longer available from mastersite
2006-11-04 sysutils/xsysinfo: no longer available from mastersite
2006-11-04 textproc/xmlada: no longer available from mastersite; 2.0 is available
2006-11-05 www/p5-CGI-Application-ValidateRM: no longer available from mastersites
2006-11-05 x11/offix-clipboard: development ceased in 1996
2006-11-05 x11/offix-execute: development ceased in 1996
2006-11-05 x11-fm/offix-files: development ceased in 1996
2006-11-05 x11-wm/icepref: is for IceWM version 1.04 (6 years old)
Cocoa libraries. The GNUstep port that can be found here, was done by me. It
was very easy to do; primarily requiring only new interface files, and build
files.
PR: 104964
Submitted by: Gürkan Sengün
is simple: Using "Text::ExtractWords" and "Lingua::StopWords" from CPAN,
it determines how many of the known stopwords the document contains for
each language supported by "Lingua::StopWords".
Each word in the document recognized as stopword of a particular
language scores one point for this language.
The "language_guess()" function takes a document as a parameter and
returns the abbreviation of the language that it is most likely written
in.
Author: Mike Schilli <cpan@perlmeister.com>
WWW: http://search.cpan.org/~mschilli/Text-Language-Guess-0.02/
PR: ports/103571
Submitted by: Masahiro Teramoto <markun@onohara.to>
ffe is a program for extracting fields from flat file records and dis-
playing them in different formats. ffe relies on the configuration file
to control input file structure and the output format.
WWW: http://sourceforge.net/projects/ff-extractor/
Author: Timo Savinen <tjsa@iki.fi>
arbitrary text and also allows you to mark up a text as HTML
with the keywords.
A Hatena keyword is an element in a suite of web sites
*.hatena.ne.jp having blogs and social bookmarks among others.
Please refer to http://d.hatena.ne.jp/keyword/ (in Japanese) for details.
In Hatena Diary, a blog hosting service, a Hatena keyword found in
a posting is linked to the keywords page automatically.
You can implement the same kind of feature outside Hatena using this module.
It queries Hatena Keyword Link API internally for retrieving terms
Author: Naoya Ito <naoya@bloghackers.net>
WWW: http://search.cpan.org/~naoya/Hatena-Keyword-0.04/
PR: ports/102794
Submitted by: Masahiro Teramoto <markun(at)onohara.to>
This is a smaller, cheaper, faster SED implementation. Minix uses it. GNU
used to use it, until they built their own sed around an extended (some
would say over-extended) regexp package.
For embedded use we searched for a tiny sed implementation especially for
use with the dietlibc and found Eric S. Raymond's sed implementation quite
handy. Though it suffered several bugs and was not under active maintenance
anymore. After sending a bunch of fixes we agreed to continue maintaining
this lovely, historic sed implementation.
Along a lot fixes and cleanups, further speedups, and some missing features
and POSIX conformance, we also added a test-suite to the package, so
regressions are quickly and easily uncovered.
WWW: http://www.exactcode.de/oss/minised/
Author: ExactCode <info@exactcode.de>
Basically, this package contains:
- Functions to automatically adjust and cycle the section underline
decorations;
- A mode that displays the table of contents and allows you to jump anywhere
from it;
- Functions to insert and automatically update a TOC in your source
document;
- A mode which supports font-lock highlighting of reStructuredText
structures;
- Some other convenience functions.
This package is the result of merging:
- restructuredtext.el
- rst-mode.el
- rst-html.el
Those files are now OBSOLETE and have been replaced by this single
package file (2005-10-30).
WWW: http://docutils.sourceforge.net/docs/user/emacs.html
PR: ports/102384
Submitted by: Denis Shaposhnikov <dsh at vlink.ru>
Perl. Everything is implemented as a small plugin and you can mash
them up together using Plagger core API and plugin hooks. You can
think of Plagger as a blosxom or qpsmtpd for RSS aggregator.
WWW: http://plagger.org/
WARNING: This port depends on thousands of ports spececially with
full options.
xxdiff is a computer program that allows a user (usually a software
developer of some sort) to easily visualize the differences between
files. The manner and goal for which this process is applied over
multiple files is highly dependent on the application, and most of
the time is driven by custom user scripts.
For example, a configuration management engineer in a company might
provide some kind of merge policing environment, that allows software
developers to review changes in files for the purpose of accepting or
rejecting a submitted changeset to a codebase. Another example is
that of a developer wishing to review the changes he made to a
checkout of files from a source-code management system such as CVS,
Subversion, ClearCase, Perforce, etc.
WWW: http://furius.ca/xxdiff/doc/xxdiff-scripts.html
Flex is a tool for generating scanners. A scanner, sometimes called a
tokenizer, is a program which recognizes lexical patterns in text. The
flex program reads user-specified input files, or its standard input
if no file names are given, for a description of a scanner to generate.
The description is in the form of pairs of regular expressions and C
code, called rules. Flex generates a C source file named, "lex.yy.c",
which defines the function yylex(). The file "lex.yy.c" can be compiled
and linked to produce an executable. When the executable is run, it
analyzes its input for occurrences of text matching the regular
expressions for each rule. Whenever it finds a match, it executes the
corresponding C code.
WWW: http://flex.sourceforge.net/
Note that there's flex 2.5.4 in the base system. This port provides
a newer version for programs that require it, textproc/xxdiff for one.
This module provides functions that deals with formatting data with
Content-Type 'text/plain; format=flowed' as described in RFC2646
(http://www.rfc-editor.org/rfc/rfc2646.txt). In a nutshell,
format=flowed text solves the problem in plain text files where it
is not known which lines can be considered a logical paragraph,
enabling lines to be automatically flowed (wrapped and/or joined)
as appropriate when displaying.
In format=flowed, a soft newline is expressed as " \n", while hard
newlines are expressed as "\n". Soft newlines can be automatically
deleted or inserted as appropriate when the text is reformatted.
WWW: http://search.cpan.org/dist/Text-Flowed/
Justification: socialtext dependency
This provides a simple interface to Plucene. Plucene is large and multi-
featured, and it expected that users will subclass it, and tie all the
pieces together to suit their own needs. Plucene::Simple is, therefore,
just one way to use Plucene. It's not expected that it will do exactly
what *you* want, but you can always use it as an example of how to
build your own interface.
WWW: http://search.cpan.org/dist/PluceneSimple/
Justification: socialtext dependency
Quirks: 1/6 test fails