b9d15002ae
Pig is an engine for executing data flows in parallel on Hadoop.
23 lines
1.3 KiB
Text
23 lines
1.3 KiB
Text
Apache Pig is a platform for analyzing large data sets that consists of a
|
|
high-level language for expressing data analysis programs, coupled with
|
|
infrastructure for evaluating these programs. The salient property of Pig
|
|
programs is that their structure is amenable to substantial parallelization,
|
|
which in turns enables them to handle very large data sets.
|
|
|
|
At the present time, Pig's infrastructure layer consists of a compiler that
|
|
produces sequences of Map-Reduce programs, for which large-scale parallel
|
|
implementations already exist (e.g., the Hadoop subproject). Pig's language
|
|
layer currently consists of a textual language called Pig Latin, which has
|
|
the following key properties:
|
|
|
|
-- Ease of programming. It is trivial to achieve parallel execution of simple,
|
|
"embarrassingly parallel" data analysis tasks. Complex tasks comprised of
|
|
multiple interrelated data transformations are explicitly encoded as data flow
|
|
sequences, making them easy to write, understand, and maintain.
|
|
-- Optimization opportunities. The way in which tasks are encoded permits the
|
|
system to optimize their execution automatically, allowing the user to focus
|
|
on semantics rather than efficiency.
|
|
-- Extensibility. Users can create their own functions to do special-purpose
|
|
processing.
|
|
|
|
WWW: http://pig.apache.org/
|