[Hive] Hive SQL Range-Based Statistics

Requirement:

A table contains a text field. We need to count the number of records whose text length falls into ranges [0,20], [20,40], [40,60], etc. The ranges should automatically extend to cover the maximum value.

Implementation:

Method 1:

1
2
3
4
5
6
7
select count(case when length(text) between 0 and 20 then 1 end) as text1,
count(case when length(text) between 21 and 40 then 1 end) as text2,
count(case when length(text) between 41 and 60 then 1 end) as text3,
count(case when length(text) between 61 and 80 then 1 end) as text4,
count(case when length(text) between 81 and 100 then 1 end) as text5,
count(case when length(text) > 100 then 1 end) as text6
from table;

The problem with this approach is that you need to know the max value first. As the max increases, the SQL statement gets longer and longer, and it requires manual intervention.

Method 2:

1
select floor(length(text)/20) * 20, count(*) from table group by floor(length(text)/20);

This method divides each text’s length by 20, floors the result, then multiplies by 20 to get a grouping key. Using this key for GROUP BY gives us the final result. Without any loops, this simple statement elegantly solves the automatic range-based statistics problem.