Summaries/Apache/Apache Hive/_ Data Sorting.md

1.5 KiB

title updated created
# Data Sorting 2022-04-03 16:53:16Z 2022-04-03 16:50:40Z
  • ORDER BY [ASC|DESC] It performs a global sort using only one reducer, so it takes longer to return the result. Using LIMIT with ORDER BY is strongly recommended.
SELECT name
FROM employee -- Order by expression
ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;

SELECT * FROM emp_simple
ORDER BY work_place NULL LAST;
  • SORT BY [ASC|DESC]: which columns to use to sort reducer input records. This means the sorting is completed before sending data to the reducer.
SELECT name FROM employee SORT BY name DESC;
  • DISTRIBUTE BY: It is very similar to GROUP BY when the mapper decides to which reducer it can deliver the output. Compared to GROUP BY, DISTRIBUTE BY will not work on data aggregations, such as count(*), but only directs where data goes
SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;

SELECT name, start_date
FROM employee_hr
DISTRIBUTE BY start_date SORT BY name;
  • CLUSTER BY: shortcut operator you can use to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. The CLUSTER BY statement does not allow you to specify ASC or DESC yet. Compared to ORDER BY, which is globally sorted, the CLUSTER BY statement sorts data in each distributed group:
SELECT name, employee_id FROM employee_hr CLUSTER BY name;

e9effef3a9891b908b2197d351856eff.png