A new measure for Percentile precalculation

Apr 1, 2017 • Dong Li

Introduction

Since Apache Kylin 2.0, there’s a new measure for percentile precalculation, which aims at (sub-)second latency for approximate percentile analytics SQL queries. The implementation is based on t-digest library under Apachee 2.0 license, which provides a high-effecient data structure to save aggregation counters and algorithm to calculate approximate result of percentile.

Percentile

From wikipedia: A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.

In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation function called percentile(<Number Column>, <Double>):

SELECT seller_id, percentile(price, 0.5)
FROM test_kylin_fact
GROUP BY seller_id

How to use

If you know little about Cubes, please go to QuickStart first to learn basic knowledge.

Firstly, you need to add this column as measure in data model.

Secondly, create a cube and add a PERCENTILE measure.

Finally, build the cube and try some query.