support CJK texts natively. This module encodes terms in MIME::Base64
format to get around this problem. Texts are assumbed to be in UTF-8
encoding.
WWW: http://search.cpan.org/dist/Plucene-Analysis-CJKAnalyzer/
PR: ports/114376
Submitted by: Gea-Suan Lin <gslin at gslin.org>
OpenBSD. It lacks some features of GNU sort. It is a proposed project idea
to replace the GNU sort with this one, but it needs to be completed first.
Patches are highly appreciated.
WWW: http://www.freebsd.org/projects/ideas/#p-bsdtexttools
Obtained from: OpenBSD
OpenBSD. It lacks some features of GNU grep. It is a proposed project idea
to replace the GNU grep with this one, but it needs to be completed first.
Patches are highly appreciated.
WWW: http://www.freebsd.org/projects/ideas/#p-bsdtexttools
Obtained from: OpenBSD
OpenBSD. It lacks some features of GNU diff. It is a proposed project idea
to replace the GNU diff with this one, but it needs to be completed first.
Patches are highly appreciated.
WWW: http://www.freebsd.org/projects/ideas/#p-bsdtexttools
Obtained from: OpenBSD
The Open Text Summarizer is an open source tool for summarizing texts.
The program reads a text and decides which sentences are important and
which are not.
WWW: http://libots.sourceforge.net/
Inspired by: Debian Package of the Day
Based on: OpenBSD port
2007-01-01 textproc/ruby-html-parser: distfile and homepage disappeared
2007-03-10 textproc/ruby-libxslt: Broken on all supported versions of FreeBSD
2007-05-26 www/py-htmltestcase: Upstream site disappeared and dependency is set to expire
html2text is a Python script that convers a page of HTML into clean,
easy-to-read plain ASCII text. Better yet, that ASCII also happens to
be valid Markdown (a text-to-HTML format).
WWW: http://www.aaronsw.com/2002/html2text/
Author: Aaron Swartz <me@aaronsw.com>
Inspired by: pkgsrc package
2007-04-10 textproc/ocaml-yaxi: Does not build
2007-04-10 ukrainian/pine.language: Leaves behind config file on deinstall
2007-04-10 www/mod_zap: Incomplete pkg-plist
2007-04-10 www/sahana2: Conflicting dependencies: php4 vs php5
2007-04-10 www/urchin5: Does not install
2007-04-07 databases/cyrus-smlacapd: this software is obsolete
Simple Blog Code is a simple markup language. You can use it for guest
books, blogs, wikis, boards and various other web applications. It
produces valid and semantic (X)HTML from input and is patterned on that
tiny usenet markups like *bold* and _underline_.
pdfoutline adds outlines (aka bookmarks) to PDF files. It reads input
file given as first argument, adds outlines from text file given as
second argument, and saves result to file with name given as third
argument.
WWW: http://sourceforge.net/projects/fntsample/
Author: Eugeniy Meshcheryakov <eugeniy@users.sourceforge.net>
It is a generic syntax highlighter for general use in all kinds of software
such as forum systems, wikis or other applications that need to prettify
source code. Highlights are:
* a wide range of common languages and markup formats is supported
* special attention is paid to details, increasing quality by a fair amount
* support for new languages and formats are added easily
* a number of output formats, presently HTML, LaTeX, RTF and ANSI sequences
* it is usable as a command-line tool and as a library
WWW: http://pygments.org/
Data::SpreadPagination can be used to create an easy to use spread pagination
navigator. It inherits from Data::Page, and in addition provides methods to
create a pagination spread, keeping pagenumbers displayed within a sensible
limit.
WWW: http://search.cpan.org/dist/Data-SpreadPagination/
PR: ports/110677
Submitted by: Sergei Vyshenski <svysh@pn.sinp.msu.ru>
Russian and German Languages. Version 2.
Finds the lemmas (all forms) of a word.
Written in C++.
WWW: http://www.aot.ru/
- Andrei V. Shetuhin
slonik-v-domene@mail.rureki@reki.ru
PR: ports/110137
Submitted by: Andrei V. Shetuhin
is a bit different on these points:
(1) The project is end-user oriented, that is, it tries to hide as much
as possible the latex compiling stuff by providing a single clean
script to produce directly DVI, PostScript and PDF output.
(2) The actual output rendering is done not only by the XSL stylesheets
transformation, but also by a dedicated LaTeX package. The purpose is
to allow a deep LaTeX customisation without changing the XSL
stylesheets.
(3) Post-processing is done by Python, to make publication faster,
convert the images if needed, and do the whole compilation.
WWW: http://dblatex.sourceforge.net/
PR: ports/109520
Submitted by: Peter Johnson <johnson.peter at gmail.com>
and at the same time be as close as possible to the original Java API.
This has the combined advantage of providing perl programmers with a
well-documented API and giving them access to a C++ search engine
library that is supposedly faster than the original.
WWW: http://search.cpan.org/dist/Lucene/
WWW: http://sourceforge.net/projects/clucene/
2006-12-30 textproc/ruby-htmlcompact: distfile and homepage disappeared
2006-12-30 textproc/ruby-rwv2: distfile disappeared and has no homepage
Approved by: erwin (mentor, implicit)
It can be used for programmatically access outside HTML-pages.
I hope to extend it to become a web-publishing framework in the future.
Author: Johannes Brodwall <johannes@brodwall.com>
WWW: http://rubyforge.org/projects/ruby-htmltools/
to another. It can read markdown and (subsets of) reStructuredText,
HTML, and LaTeX, and it can write markdown, reStructuredText, HTML,
LaTeX, DocBook, RTF, and S5 HTML slide shows.
Pandoc extends standard markdown syntax with footnotes, embedded LaTeX,
and other features. A compatibility mode is provided for those who
need a drop-in replacement for Markdown.pl. Included wrapper scripts
make it easy to convert markdown documents to PDFs and to convert web
pages to markdown documents.
In contrast to existing tools for converting markdown to HTML, which
use regex substitutions, pandoc has a modular design: it consists of a
set of readers, which parse text in a given format and produce a native
representation of the document, and a set of writers, which convert
this native representation into a target format. Thus, adding an input
or output format requires only adding a reader or writer.
WWW: http://sophos.berkeley.edu/macfarlane/pandoc/
PR: ports/109028
Submitted by: John MacFarlane <jgm at berkeley.edu>
Approved by: miwi (mentor)
for parsing, generating, and processing HTML, XML or other textual content
for output generation on the web. The major feature is a template language,
which is heavily inspired by Kid.
WWW: http://genshi.wedgewall.org/
Approved by: alexbl (mentor, implicit)
algorithms can either be applied directly to a dataset or called from your own
Java code. Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes.
WWW: http://www.cs.waikato.ac.nz/ml/weka/
PR: ports/108143
Submitted by: Simon Olofsson <simon at olofsson.de>
Just select the text, click on the service item menu, choose
"Return the LaTeX rendering" and voila! Your text is replaced by
its LaTeX rendering.
WWW: http://www.roard.com/latexservice/
streams. It supports the whole XML 1.0 specifications, and can parse
any file that follows this standard (including the contents of the
DTD).
It also provides support for a number of other standard associated
with XML, like SAX and DOM.
In addition, It includes a module to manipulate Unicode streams, since
this is required by the XML standard.
This version of GtkAda is designed to be used with lang/gnat-gcc41.
WWW: https://libre2.adacore.com/xmlada/
WWW: http://gnuada.sourceforge.net/
Author: Petr Holub <hopet@ics.muni.cz>
PR: ports/107180
Submitted by: hopet at ics.muni.cz
LuceneKit is a class-to-class port of Lucene in GNUstep. It is a technology
suitable for nearly any application that requires full-text search.
WWW: http://www.etoile-project.org/
It uses OniGuruma as regular expression engine.
This is a GNUstep fork of OgreKit 2.1.2
<http://www8.ocn.ne.jp/~sonoisa/OgreKit/>.
Since it is a fork, the API may differ in the future.
Original licence of OgreKit is BSD License.
This fork uses also BSD license (see COPYING document).
WWW: http://www.etoile-project.org/
a classic GNU-style ChangeLog from a subversion repository log. It is made
from several changelog-like scripts using common xslt constructs found in
different places.
WWW: http://ch.tudelft.nl/~arthur/svn2cl/
PR: ports/107007
Submitted by: Alexander Logvinov <ports at logvinov.com>
a stack of flashcards, but handles one-to-many and many-to-one word
relationships better, and includes an integrated scheduler for efficient use
of your 'cards'. Popup was written by Bjorn Ghola and Rob Burns.
Features:
* An editor for cardstack files with support for copying and pasting groups
of words, as well as drag and drop.
* Three quiz styles: multiple choice, spelling, and flashcard.
* Supports quizes and practice
* Graduated time interval scheduler.
* Localized for Thai and German.
WWW: http://popup.sourceforge.net/
software tool that converts the plain text formatting to (X)HTML. The
formatting syntax is designed to be easy and intuitive for web authors
and resembles typical email formatting conventions. The resultant
(X)HTML is structurally valid.
WWW: http://www.freewisdom.org/projects/python-markdown
PR: ports/105992
Submitted by: Graham Todd <gtodd at bellanet.org>
technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization".
It was primarily developed for language guessing, a task on which it is known to
perform with near-perfect accuracy.
WWW: http://software.wise-guys.nl/libtextcat/
.strings files must be distributed in ASCII encoding, which generally
isn't a convenient encoding to do translation in. As an example, its rather
difficult to enter Chinese characters into an ASCII encoded text file.
Localize will, with any luck, help out with this. Currently its just a
shell of an application, but sometime in the future I hope to complete it.
WWW: http://www.eskimo.com/~pburns/Localize/
It provides a shared library to parse, generate, mainpulate and
validate XML documents from within your own application.
(Linux version)
WWW: http://xml.apache.org/xerces-c/
PR: ports/105275
Submitted by: Alexander Logvinov <ports at logvinov.com>
2. Commercial license is also available for embedded use.
Generally, it's a standalone search engine, meant to provide fast,
size-efficient and relevant fulltext search functions to other
applications. Sphinx was specially designed to integrate well with SQL
databases and scripting languages. Currently built-in data sources
support fetching data either via direct connection to MySQL, or from
an XML pipe.
As for the name, Sphinx is an acronym which is officially decoded as
SQL Phrase Index.
WWW: http://www.sphinxsearch.com/
PR: ports/105649
Submitted by: Matthew Seaman <m.seaman at infracaninophile.co.uk>
Unicode::Unihan - The Unihan Data Base 3.2.0
use Unicode::Unihan;
my $db = new Unicode::Unihan;
print join("," => $db->Mandarin("\x{5c0f}\x{98fc}\x{5f3e}"), "\n";
This module provides a user-friendly interface to the Unicode Unihan
Database 3.2. With this module, the Unihan database is as easy as shown in
above.
WWW: http://search.cpan.org/dist/Unicode-Unihan/
2006-11-05 deskutils/offix-trash: development ceased in 1996
2006-11-04 devel/mingw: use mingw32-* ports instead
2006-11-04 devel/mingw-binutils: use mingw32-* ports instead
2006-11-04 devel/mingw-bin-msvcrt: use mingw32-* ports instead
2006-11-04 devel/mingw-gcc: use mingw32-* ports instead
2006-11-04 devel/mingw-opengl-headers: use mingw32-* ports instead
2006-11-05 editors/offix-editor: developement ceased in 1996
2006-11-05 print/offix-printer: development ceased in 1996
2006-11-05 sysutils/wmmon: no longer available from mastersite
2006-11-04 sysutils/xsysinfo: no longer available from mastersite
2006-11-04 textproc/xmlada: no longer available from mastersite; 2.0 is available
2006-11-05 www/p5-CGI-Application-ValidateRM: no longer available from mastersites
2006-11-05 x11/offix-clipboard: development ceased in 1996
2006-11-05 x11/offix-execute: development ceased in 1996
2006-11-05 x11-fm/offix-files: development ceased in 1996
2006-11-05 x11-wm/icepref: is for IceWM version 1.04 (6 years old)
Cocoa libraries. The GNUstep port that can be found here, was done by me. It
was very easy to do; primarily requiring only new interface files, and build
files.
PR: 104964
Submitted by: Gürkan Sengün
is simple: Using "Text::ExtractWords" and "Lingua::StopWords" from CPAN,
it determines how many of the known stopwords the document contains for
each language supported by "Lingua::StopWords".
Each word in the document recognized as stopword of a particular
language scores one point for this language.
The "language_guess()" function takes a document as a parameter and
returns the abbreviation of the language that it is most likely written
in.
Author: Mike Schilli <cpan@perlmeister.com>
WWW: http://search.cpan.org/~mschilli/Text-Language-Guess-0.02/
PR: ports/103571
Submitted by: Masahiro Teramoto <markun@onohara.to>
ffe is a program for extracting fields from flat file records and dis-
playing them in different formats. ffe relies on the configuration file
to control input file structure and the output format.
WWW: http://sourceforge.net/projects/ff-extractor/
Author: Timo Savinen <tjsa@iki.fi>
arbitrary text and also allows you to mark up a text as HTML
with the keywords.
A Hatena keyword is an element in a suite of web sites
*.hatena.ne.jp having blogs and social bookmarks among others.
Please refer to http://d.hatena.ne.jp/keyword/ (in Japanese) for details.
In Hatena Diary, a blog hosting service, a Hatena keyword found in
a posting is linked to the keywords page automatically.
You can implement the same kind of feature outside Hatena using this module.
It queries Hatena Keyword Link API internally for retrieving terms
Author: Naoya Ito <naoya@bloghackers.net>
WWW: http://search.cpan.org/~naoya/Hatena-Keyword-0.04/
PR: ports/102794
Submitted by: Masahiro Teramoto <markun(at)onohara.to>
This is a smaller, cheaper, faster SED implementation. Minix uses it. GNU
used to use it, until they built their own sed around an extended (some
would say over-extended) regexp package.
For embedded use we searched for a tiny sed implementation especially for
use with the dietlibc and found Eric S. Raymond's sed implementation quite
handy. Though it suffered several bugs and was not under active maintenance
anymore. After sending a bunch of fixes we agreed to continue maintaining
this lovely, historic sed implementation.
Along a lot fixes and cleanups, further speedups, and some missing features
and POSIX conformance, we also added a test-suite to the package, so
regressions are quickly and easily uncovered.
WWW: http://www.exactcode.de/oss/minised/
Author: ExactCode <info@exactcode.de>
Basically, this package contains:
- Functions to automatically adjust and cycle the section underline
decorations;
- A mode that displays the table of contents and allows you to jump anywhere
from it;
- Functions to insert and automatically update a TOC in your source
document;
- A mode which supports font-lock highlighting of reStructuredText
structures;
- Some other convenience functions.
This package is the result of merging:
- restructuredtext.el
- rst-mode.el
- rst-html.el
Those files are now OBSOLETE and have been replaced by this single
package file (2005-10-30).
WWW: http://docutils.sourceforge.net/docs/user/emacs.html
PR: ports/102384
Submitted by: Denis Shaposhnikov <dsh at vlink.ru>
Perl. Everything is implemented as a small plugin and you can mash
them up together using Plagger core API and plugin hooks. You can
think of Plagger as a blosxom or qpsmtpd for RSS aggregator.
WWW: http://plagger.org/
WARNING: This port depends on thousands of ports spececially with
full options.
xxdiff is a computer program that allows a user (usually a software
developer of some sort) to easily visualize the differences between
files. The manner and goal for which this process is applied over
multiple files is highly dependent on the application, and most of
the time is driven by custom user scripts.
For example, a configuration management engineer in a company might
provide some kind of merge policing environment, that allows software
developers to review changes in files for the purpose of accepting or
rejecting a submitted changeset to a codebase. Another example is
that of a developer wishing to review the changes he made to a
checkout of files from a source-code management system such as CVS,
Subversion, ClearCase, Perforce, etc.
WWW: http://furius.ca/xxdiff/doc/xxdiff-scripts.html
Flex is a tool for generating scanners. A scanner, sometimes called a
tokenizer, is a program which recognizes lexical patterns in text. The
flex program reads user-specified input files, or its standard input
if no file names are given, for a description of a scanner to generate.
The description is in the form of pairs of regular expressions and C
code, called rules. Flex generates a C source file named, "lex.yy.c",
which defines the function yylex(). The file "lex.yy.c" can be compiled
and linked to produce an executable. When the executable is run, it
analyzes its input for occurrences of text matching the regular
expressions for each rule. Whenever it finds a match, it executes the
corresponding C code.
WWW: http://flex.sourceforge.net/
Note that there's flex 2.5.4 in the base system. This port provides
a newer version for programs that require it, textproc/xxdiff for one.
This module provides functions that deals with formatting data with
Content-Type 'text/plain; format=flowed' as described in RFC2646
(http://www.rfc-editor.org/rfc/rfc2646.txt). In a nutshell,
format=flowed text solves the problem in plain text files where it
is not known which lines can be considered a logical paragraph,
enabling lines to be automatically flowed (wrapped and/or joined)
as appropriate when displaying.
In format=flowed, a soft newline is expressed as " \n", while hard
newlines are expressed as "\n". Soft newlines can be automatically
deleted or inserted as appropriate when the text is reformatted.
WWW: http://search.cpan.org/dist/Text-Flowed/
Justification: socialtext dependency
This provides a simple interface to Plucene. Plucene is large and multi-
featured, and it expected that users will subclass it, and tie all the
pieces together to suit their own needs. Plucene::Simple is, therefore,
just one way to use Plucene. It's not expected that it will do exactly
what *you* want, but you can always use it as an example of how to
build your own interface.
WWW: http://search.cpan.org/dist/PluceneSimple/
Justification: socialtext dependency
Quirks: 1/6 test fails
Bastardize provides an magical object into which text can be charged
and then returned in various, slighty modified ways.
Among others, bastardize has the following methods:
rdct converts english to hyperreductionist english
(ex. "english" becomes "")
pig pig latin
(ex. "hi there" becomes "ihay erethay")
k3wlt0k a k3wlt0kizer developed originally by Fmh
rot13 implements rot13 "encryption" in perl
(ex. "foo bar" becomes "sbb one")
rev reverses the arrangement of characters
censor attempts to censor text which might be innaproriate
n20e performs numerical abbreviations
(ex. "numerical_abbreviation" becomes "n20e")
WWW: http://search.cpan.org/dist/Text-Bastardize/
This is an XS wrapper around some Unicode Consortium code to check if
a string is valid UTF-8, revised to conform to what expat/Mozilla
think is valid UTF-8, especially with regard to low-ASCII characters.
Note that this module has NOTHING to do with Perl's internal UTF8 flag
on scalars.
This module is for use when you're getting input from users and want
to make sure it's valid UTF-8 before continuing.
WWW: http://search.cpan.org/dist/Unicode-CheckUTF8/
The goals of this project are simple:
Create a highly configurable, easily modifiable source code beautifier.
What it does:
* Ident code, aligning on parens, assignments, etc
* Align on '=' and variable definitions
* Align structure initializers
* Align #define stuff
* Align backslash-newline stuff
* Reformat comments (a little bit)
* Fix inter-character spacing
* Add or remove parens on return statements
* Add or remove braces on single-statement if/do/while/for statements
* Highly configurable - 118 configurable options as of version 0.0.15
WWW: http://uncrustify.sourceforge.net
PR: ports/100604
Submitted by: Dmitry Marakasov <amdmi3 at mail.ru>
- by default, textproc/aspell installs the English dictionaries (no
change);
- thereafter you can install any foreign dictionary;
- when you install a foreign dictionary, i.e. french/aspell or
textproc/da-aspell, it installs only the dictionaries, and depends
upon textproc/aspell for the programs;
- if you don't need the English dictionaries, you can define
WITHOUT_DICTEN or install textproc/aspell-without-dicten;
- add a new port for textproc/en-aspell: if aspell had been installed
without the English dictionaries, they can be added thereafter;
- add a missing port for german/alt-aspell;
- foreign dictionaries are almost independent from textproc/aspell,
and their maintainership is available.
Credits: special thanks to Serge Gagnon <ser_gagnon (at) sympatico.ca>
specification generously provided by Adobe at
http://partners.adobe.com/public/developer/pdf/index_reference.html
The file format is well-supported, with the exception of the
"linearized" or "optimized" output format, which this module can read
but not write. Many specific aspects of the document model are not
manipulable with this package (like fonts), but if the input document
is correctly written, then this module will preserve the model
integrity.
This library grants you some power over the PDF security model. Note
that applications editing PDF documents via this library MUST respect
the security preferences of the document. Any violation of this
respect is contrary to Adobe's intellectual property position, as
stated in the reference manual at the above URL.
WWW: http://search.cpan.org/dist/CAM-PDF/
PR: ports/100182
Submitted by: Gea-Suan Lin <gslin at gslin.org>
ecore data structures and making things generally easy to get around in.
The functions detailed in EXML.h are fairly self explanatory, and the io
interfaces are also generalized and independent (open from a socket, write
to in memory xml image).
WWW: http://www.enlightenment.org/
PR: ports/100002
Submitted by: Stanislav Sedov <ssedov at mbsd.msk.ru>
Since JSON is a pure-perl module and JSON::Syck is based on libsyck,
JSON::Syck is supposed to be very fast and memory efficient. See
chansen's benchmark table at
http://idisk.mac.com/christian.hansen/Public/perl/serialize.pl
JSON.pm comes with dozens of ways to do the same thing and lots of
options, while JSON::Syck doesn't. There's only Load and Dump.
Oh, and JSON::Syck doesn't use camelCase method names :-)
Author: Audrey Tang <autrijus@autrijus.org>
Tatsuhiko Miyagawa <miyagawa@gmail.com>
WWW: http://search.cpan.org/dist/JSON-Syck/
PR: ports/100071
Submitted by: Gea-Suan Lin <gslin at gslin.org>