This release contains mostly feature requests.
Features:
The stats1 verb now lets you use regular expressions to specify
which field names to compute statistics on, and/or which to
group by. Full details are here.
The min and max DSL functions, and the min/max/percentile
aggregators for the stats1 and merge-fields verbs, now support
numeric as well as string field values. (For mixed string/numeric
fields, numbers compare before strings.) This means in particular
that order statistics -- min, max, and non-interpolated percentiles
-- as well as mode, antimode, and count are now possible on
string-only (or mixed) fields. (Of course, any operations
requiring arithmetic on values, such as computing sums, averages,
or interpolated percentiles, yield an error on string-valued
input.)
There is a new DSL function mapexcept which returns a copy of
the argument with specified key(s), if any, unset. The motivating
use-case is to split records to multiple filenames depending
on particular field value, which is omitted from the output:
mlr --from f.dat put 'tee > "/tmp/data-".$a, mapexcept($*, "a")'
Likewise, mapselect returns a copy of the argument with only
specified key(s), if any, set. This resolves#137.
A new -u option for count-distinct allows unlashed counts for
multiple field names. For example, with -f a,b and without -u,
count-distinct computes counts for distinct pairs of a and b
field values. With -f a,b and with -u, it computes counts for
distinct a field values and counts for distinct b field values
separately.
If you build from source, you can now do ./configure without
first doing autoreconf -fiv. This resolves#131.
The UTF-8 BOM sequence 0xef 0xbb 0xbf is now automatically
ignored from the start of CSV files. (The same is already done
for JSON files.) This resolves#138.
For put and filter with -S, program literals such as the 6 in
$x = 6 were being parsed as strings. This is not sensible, since
the -S option for put and filter is intended to suppress numeric
conversion of record data, not program literals. To get string
6 one may use $x = "6".
Documentation:
A new cookbook example shows how to compute differences between
successive queries, e.g. to find out what changed in time-varying
data when you run and rerun a SQL query.
Another new cookbook example shows how to compute interquartile
ranges.
A third new cookbook example shows how to compute weighted
means.
Bugfixes:
CRLF line-endings were not being correctly autodetected when
I/O formats were specified using --c2j et al.
Integer division by zero was causing a fatal runtime exception,
rather than computing inf or nan as in the floating-point case.
This is a relatively minor release of Miller, containing feature
requests and bugfixes while I've been working on the Windows port
(which is nearly complete).
Features:
JSON arrays: as described here, Miller being a tabular data
processor isn't well-position to handle arbitrary JSON. (See
jq for that.) But as of 5.1.0, arrays are converted to maps
with integer keys, which are then at least processable using
Miller. Details are here. The short of it is that you now have
three options for the main mlr executable:
--json-map-arrays-on-input Convert JSON array indices to Miller
map keys. (This is the default.) --json-skip-arrays-on-input
Disregard JSON arrays. --json-fatal-arrays-on-input Raise a fatal
error when JSON arrays are encountered in the input.
This resolves#133.
The new mlr fraction verb makes possible in a few keystrokes
what was only possible before using two-pass DSL logic: here
you can turn numerical values down a column into their
fractional/percentage contribution to column totals, optionally
grouped by other key columns.
The DSL functions strptime and strftime now handle fractional
seconds. For parsing, use %S format as always; for formatting,
there are now %1S through %9S which allow you to configure a
specified number of decimal places. The return value from
strptime is now floating-point, not integer, which is a minor
backward incompatibility not worth labeling this release as
6.0.0. (You can work around this using int(strptime(...)).) The
DSL functions gmt2sec and sec2gmt, which are keystroke-savers
for strptime and strftime, are similarly modified, as is the
sec2gmt verb. This resolves#125.
A few nearly-standalone programs -- which do not have anything
to do with record streams -- are packaged within the Miller.
(For example, hex-dump, unhex, and show-line-endings commands.)
These are described here.
The stats1 and merge-fields verbs now support an antimode
aggregator, in addition to the existing mode aggregator.
The join verb now by default does not require sorted input,
which is the more common use case. (Memory-parsimonious joins
which require sorted input, while no longer the default, are
available using -s.) This another minor backward incompatibility
not worth making a 6.0.0 over. This resolves#134.
mlr nest has a keystroke-saving --evar option for a common use
case, namely, exploding a field by value across records.
Documentation:
The DSL reference now has per-function descriptions.
There is a new feature-counting example in the cookbook.
Bugfixes:
mlr join -j -l was not functioning correctly. This resolves
#136.
JSON escapes on output (\t and so on) were incorrect. This
resolves#135.
Two minor bugfixes
As described in #132, mlr nest was incorrectly splitting fields
with multi-character separators.
The XTAB-format reader, when using multi-character IPS, was
incorrectly splitting key-value pairs, but only when reading
from standard input (e.g. on a pipe or less-than redirect).
Autodetected line-endings, in-place mode, user-defined functions, and more
This major release significantly expands the expressiveness of the DSL for mlr put and mlr filter. (The upcoming 5.1.0 release will add the ability to aggregate across all columns for non-DSL verbs such as mlr stats1 and mlr stats2. As well, a Windows port is underway.)
Please also see the Miller main docs.
Simple but impactful features:
Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now autodetected. For example, files (including CSV) with LF input will lead to LF output unless you specify otherwise.
There is now an in-place mode using mlr -I.
Major DSL features:
You can now define your own functions and subroutines: e.g. func f(x, y) { return x**2 + y**2 }.
New local variables are completely analogous to out-of-stream variables: sum retains its value for the duration of the expression it's defined in; @sum retains its value across all records in the record stream.
Local variables, function parameters, and function return types may be defined untyped or typed as in x = 1 or int x = 1, respectively. There are also expression-inline type-assertions available. Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data.
There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps: @a=1, @b[1]=2, @c[1][2]=3. The same is now true for local variables, which are new to 5.0.0. Stream records have always been single-level maps; $* is a map. And as of 5.0.0 there are now map literals, e.g. {"a":1, "b":2}, which can be defined using JSON-like syntax (with either string or integer keys) and which can be nested arbitrarily deeply.
You can loop over maps -- $*, out-of-stream variables, local variables, map-literals, and map-valued function return values -- using for (k, v in ...) or the new for (k in ...) (discussed next). All flavors of map may also be used in emit and dump statements.
User-defined functions and subroutines may take map-valued arguments, and may return map values.
Some built-in functions now accept map-valued input: typeof, length, depth, leafcount, haskey. There are built-in functions producing map-valued output: mapsum and mapdiff. There are now string-to-map and map-to-string functions: splitnv, splitkv, splitnvx, splitkvx, joink, joinv, and joinkv.
Minor DSL features:
For iterating over maps (namely, local variables, out-of-stream variables, stream records, map literals, or return values from map-valued functions) there is now a key-only for-loop syntax: e.g. for (k in $*) { ... }. This is in addition to the already-existing for (k, v in ...) syntax.
There are now triple-statement for-loops (familiar from many other languages), e.g. for (int i = 0; i < 10; i += 1) { ... }.
mlr put and mlr filter now accept multiple -f for script files, freely intermixable with -e for expressions. The suggested use case is putting user-defined functions in script files and one-liners calling them using -e. Example: myfuncs.mlr defines the function f(...), then mlr put -f myfuncs.mlr -e '$o = f($i)' myfile.dat. More information is here.
mlr filter is now almost identical to mlr put: it can have multiple statements, it can use begin and/or end blocks, it can define and invoke functions. Its final expression must evaluate to boolean which is used as the filter criterion. More details are here.
The min and max functions are now variadic: $o = max($a, $b, $c).
There is now a substr function.
While ENV has long provided read-access to environment variables on the right-hand side of assignments (as a getenv), it now can be at the left-hand side of assignments (as a putenv). This is useful for subsidiary processes created by tee, emit, dump, or print when writing to a pipe.
Handling for the # in comments is now handled in the lexer, so you can now (correctly) include # in strings.
Separators are now available as read-only variables in the DSL: IPS, IFS, IRS, OPS, OFS, ORS. These are particularly useful with the split and join functions: e.g. with mlr --ifs tab ..., the IFS variable within a DSL expression will evaluate to a string containing a tab character.
Syntax errors in DSL expressions now have a little more context.
DSL parsing and execution are a bit more transparent. There have long been -v and -t options to mlr put and mlr filter, which print the expression's abstract syntax tree and do a low-level parser trace, respectively. There are now additionally -a which traces stack-variable allocation and -T which traces statements line by line as they execute. While -v, -t, and -a are most useful for development of Miller, the -T option gives you more visibility into what your Miller scripts are doing. See also here.
Verbs:
most-frequent and least-frequent as requested in #110.
seqgen makes it easy to generate data from within Miller: please also see here for a usage example.
unsparsify makes it easy to rectangularize data where not all records have the same fields.
cat -n now takes a group-by (-g) option, making it easy to number records within categories.
count-distinct,
uniq,
most-frequent,
least-frequent,
top, and
histogram
now take a -o option for specifying their output field names, as requested in #122.
Median is now a synonym for p50 in stats1.
You can now start a then chain with an initial then, which is nice in backslashy/multiline-continuation contexts.
This was requested in #130.
I/O options:
The print statement may now be used with no arguments, which prints a newline, and a no-argument printn prints nothing but creates a zero-length file in redirected-output context.
Pretty-print format now has a --pprint --barred option (for output only, not input). For an example, please see here.
There are now keystroke-savers of the form --c2p which abbreviate --icsvlite --opprint, and so on.
Miller's map literals are JSON-looking but allow integer keys which JSON doesn't. The
--jknquoteint and --jvquoteall flags for mlr (when using JSON output) and mlr put (for dump) provide control over double-quoting behavior.
Documents new since the previous release:
Miller in 10 minutes is a long-overdue addition: while Miller's detailed documentation is evident, there has been a lack of more succinct examples.
The cookbook has likewise been expanded, and has been split out
into three parts: part 1, part
2, part 3.
A bit more background on C performance compared to other languages I experimented with, early on in the development of Miller, is here.
On-line help:
Help for DSL built-in functions, DSL keywords, and verbs is accessible using mlr -f, mlr -k, and mlr -l respectively; name-only lists are available with mlr -F, mlr -K, and mlr -L.
Bugfixes:
A corner-case bug causing a segmentation violation on two sub/gsub statements within a single put, the first one matching its pattern and the second one not matching its pattern, has been fixed.
Backward incompatibilities: This is Miller 5.0.0, not 4.6.0, due to the following (all relatively minor):
The v variables bound in for-loops such as for (k, v in some_multi_level_map) { ... } can now be map-valued if the v specifies a non-terminal in the map.
There are new keywords such as var, int, float, num, str, bool, map, IPS, IFS, IRS, OPS, OFS, ORS which can no longer be used as variable names. See mlr -k for the complete list.
Unset of the last key in an map-valued variable's map level no longer removes the level: e.g. with @v[1][2]=3 and unset @v[1][2] the @v variable would be empty. As of 5.0.0, @v has key 1 with an empty-map value.
There is no longer type-inference on literals: "3"+4 no longer gives 7. (That was never a good idea.)
The typeof function used to say things like MT_STRING; now it says things like string.
4.5.0
Customizable output format for redirected output
In a natural follow-on to the 4.4.0 redirected-output feature, the
4.5.0 release allows your tap-files to be in a different output
format from the main program output.
For example, using
mlr --icsv --opprint ... then put --ojson 'tee > "mytap-".$a.".dat",
$*' then ...
the input is CSV, the output is pretty-print tabular, but the
tee-files output is written in JSON format. Likewise --ofs, --ors,
--ops, --jvstack, and all other output-formatting options from the
main help at mlr -h and/or man mlr default to the main command-line
options, and may be overridden with flags supplied to mlr put and
mlr tee.
4.4.0
Redirected output, row-value shift, and other features
The principal feature of Miller 4.4.0 is redirected output. Inspired
by awk, Miller lets you tap/tee your data as it's processed, run
output through subordinate processes such as gzip and jq, split a
single file into multiple files per an account-ID column, and so
on.
Details:
http://johnkerl.org/miller/doc/reference.html#Redirected-output_statements_for_put
Other features:
mlr step -a shift allows you to place the previous record's
values alongside the current record's values:
http://johnkerl.org/miller/doc/reference.html#step
mlr head, when used without the group-by flag (-g), stops after
the specified number of records has been output. For example,
even with a multi-gigabyte data file, mlr head -n 10 hugefile.dat
will complete quickly after producing the first ten records
from the file.
The sec2gmtdate verb, and sec2gmtdate function for filter/put,
is new: please see
http://johnkerl.org/miller/doc/reference.html#sec2gmtdate and
http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put.
sec2gmt and sec2gmtdate both leave non-numbers as-is, rather
than formatting them as (error). This is particularly relevant
for formatting nullable epoch-seconds columns in SQL-table
output: if a column value is NULL then after sec2gmt or
sec2gmtdate it will still be NULL.
The dot operator has been universalized to work with any data
type and produce a string. For example, if the field n has
integers, then instead of typing mlr put '$name = "value:".string($n)'
you can now simply domlr put '$name = "value:".$n'. This is
particularly timely for creating filenames for redirected
print/dump/tee/emit output.
The online documents now have a copy of the Miller manpage:
http://johnkerl.org/miller/doc/manpage.html
Bugfix: inside filter/put, $x=="" was distinct from isempty($x).
This was nonsensical; now both are the same.
Use release tarball and drop autotools dependencies.
Changes in 3.4.0:
JSON, reshape, regex captures, and more
Primary features:
JSON is now a supported format for input and output. Miller handles tabular data, and JSON supports arbitrarily deeply nested data structures, so if you want general JSON processing you should use jq. But if you have tabular data represented in JSON then Miller can now handle that for you. Please see the reference page and the FAQ.
Reshape is a standard data-processing idiom, now available in Miller: http://johnkerl.org/miller/doc/reference.html#reshape
Incidentally (not part of this release, but new since the last release) Miller is now available in FreeBSD's package manager: https://www.freshports.org/textproc/miller/. A full list of distributions containing Miller may be found here.
Miller is not yet available from within Fedora/CentOS, but as a step toward this goal, an SRPM is included in this release (see file-list below).
DSL enhancements for mlr put and mlr filter:
Regex captures \0 through \9: http://johnkerl.org/miller/doc/reference.html#Regex_captures
Ternary operator in expression right-hand sides: e.g. mlr put '$y = $x < 0.5 ? 0 : 1'
Boolean literals true and false
Final semicolon is now allowed: e.g. mlr put '$x=1;$y=2;'
Environment variables are now accessible, where environment-variable names may be string literals or arbitrary expressions: mlr put '$home = ENV["HOME"]' or mlr put '$value = ENV[$name]'.
While records are still string-to-string maps for input and output, and between then statements, types are preserved between multiple statements within a put. Example: mlr put '$y = string($x); $z = $y . $y' works as expected, without requring mlr put '$y = string($x); $z = string($y) . string($y)' as before.
Bug fixes:
Mixed-format join, e.g. CSV file joined with DKVP file, was incorrectly computing default separators (IRS, IFS, IPS). This resulted in records not being joined together.
Segmentation violation on non-standard-input read of files with size an exact multiple of page size and not ending in IRS, e.g. newline. (This is less of a corner case than it sounds: for example, leave a long-running program running with output redirected to a file, then in a sleep-and-process loop, have Miller process that file. The former program's stdio library will likely be doing block-sized buffered I/O, where block sizes will often be multiples of system page size and the block will almost surely not ending a newline.)
Acknowledgements: Big thank-yous to @gregfr and @aaronwolen for feature requests including reshape and regex captures, and to @jungle-boogie for his work getting Miller into FreeBSD. Also, ongoing thanks to @0-wiz-0 for his past work on configure support, making it possible for Miller to be put to use in multiple operating systems.
3.3.2
Bootstrap sampling, EWMA, merge-fields, isnull/isnotnull functions
@johnkerl johnkerl released this on Jan 11 · 497 commits to master since this release
Bootstrap sampling in mlr bootstrap: http://johnkerl.org/miller/doc/reference.html#bootstrap. Compare to reservoir sampling in mlr sample: http://johnkerl.org/miller/doc/reference.html#sample.
Exponentially weighted moving averages in mlr step -a ewma: principally useful for smoothing of noisy time series, e.g. finely sampled system-resource utilization to give one of many possible examples. Please see http://johnkerl.org/miller/doc/reference.html#step.
"Horizontal" univariate statistics in mlr merge-fields, compared to mlr stats which is "vertical". Also allows collapsing multiple fields into one, such as in_bytes and out_bytes data fields summing to bytes_sum. This can also be done easily using mlr put. However, mlr merge-fields allows aggregation of more than just a pair of field names, and supports pattern-matching on field names. Please see http://johnkerl.org/miller/doc/reference.html#merge-fields for more information.
isnull and isnotnull functions for mlr filter and mlr put.
stats1, stats2, merge-fields, step, and top correctly handle not only missing fields (in the row-heterogeneous-data case) but also null-valued fields.
Minor memory-management improvements.
Problems found locating distfiles:
Package cabocha: missing distfile cabocha-0.68.tar.bz2
Package convertlit: missing distfile clit18src.zip
Package php-enchant: missing distfile php-enchant/enchant-1.1.0.tgz
Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden). All existing
SHA1 digests retained for now as an audit trail.
Multi-character RS,FS,PS
You can process CRLF-terminated DKVP files with mlr --dkvp --rs
crlf.
You can process LF-terminated CSV files with mlr --csv --rs lf.
You can process TSV using mlr --fs tab; you can convert TSV to CSV
using mlr --ifs tab --ofs comma.
Along with many more possibilities.
Please see mlr -h for more information.
There is one minor, backward-incompatible change which I felt not
worth calling this 3.0.0: default field separator for NIDX format
is now space, not comma.
Changes:
v2.1.1
Incremental read-performance increase for CSV format
While #51 is still underway, already there is nearly a 2x
read-performance increase in v2.1.1 over v2.1.0.
v2.1.0
Minor enhancements and bug fixes
Highlights: travis-CI integration (thanks @SikhNerd!); hour-minute-second
functions; fixed pretty-print alignment of UTF-8 data.
Miller is like sed, awk, cut, join, and sort for name-indexed data
such as CSV.
With Miller, you get to use named fields without needing to count
positional indices.
This is something the Unix toolkit always could have done, and
arguably always should have done. It operates on key-value-pair
data while the familiar Unix tools operate on integer-indexed
fields: if the natural data structure for the latter is the array,
then Miller's natural data structure is the insertion-ordered hash
map. This encompasses a variety of data formats, including but not
limited to the familiar CSV. (Miller can handle positionally-indexed
data as a special case.)