Skip to main content

Segment Merge

In the incremental build mode, as the number of segments increases, the system may need to aggregate multiple segments to serve the query, which degrades the query performance and the query performance decreases. At the same time, a large number of small files will put pressure on the HDFS Namenode and affect the HDFS performance. Apache Kylin provides a mechanism to control the number of segments - Segments Merge .

Manual Merge

You can merge multiple Segments in the Web GUI or using Segment Manage API.

In the web GUI

  1. In the Data Assets -> Model -> Segment list, select the Segments to be merged.
  2. Click "Merge" in the drop-down list, check that three conditions are met (consistent indexes, consistent sub-partition values, and continuous time ranges) , and submit the merge task. The system submits a task of type "Merge Data". Until the task is completed, the original segment is still available. After the task is completed, it will be replaced by a new segment. To save system resources, the original segments will be recycled and cleaned up.

Auto Merge

Merging Segments is very simple, but requires manual triggering of the merge from time to time. When there are multiple projects and models in the production environment, it becomes very cumbersome to trigger the merge operation one by one. Therefore, Apache Kylin provides a segment automatic merging solution.

Auto-Merge settings

According to different business needs, it supports the automatic merging of project and model settings respectively. If the two merge strategies are different, the system adopts the model-level settings.

  • Project-level: Used for all models in a project, with the same merge strategy.
  • Model-level: used for multiple models in a project, with different automatic merging strategies.

Please refer to Segment Settings and Model/Index Group Rewrite Settings of Project Settings for the specific requirements.

Auto-merge strategy

  • Merge Timing: The system triggers an automatic merge attempt every time a new segment in the project becomes complete. To ensure query performance, all segments will not be merged at once.

  • Time Threshold: Allows the user to set a time threshold of up to 6 layers. The larger the layer, the larger the time threshold. The user can select multiple levels (eg week, month). Note: day, week and month represent natural day, natural week and natural month respectively.

    levelTime Threshold
    1hour
    2day
    3week
    4month
    5quarter
    6year

Choose Segment

When triggering an Auto-Merge, the system attempts to start from maximum layer time threshold, skips segments whose time length is greater than or equal to the threshold, select remaining eligible Segments (consistent indexes, consistent sub-partition values, and continuous time ranges).

Try Merge

When the total time length of the segments reaches the time threshold, they will be merged. After the merge task is completed, the system will trigger an Auto- Merge attempt again; otherwise, the system repeats the search process using the time threshold for the next level. Stop trying until all the selected levels have no segment that meets the condition .

Notice

  • The Auto-Merge of week is constrained by month, that is, if a natural week spans months/quarters/years, they are merged separately. (see example 2).
  • During the process of merging segments, the HDFS storage space may exceed the threshold limit, causing the merging to fail.

Example of Auto Merge

Example 1

The switch for Auto-Merge is turned on, and the specified time thresholds are week and month. There are six consecutive Segments A~F.

Segment (Initial)Time RangeTime Length
A2022-01-01 ~ 2022-01-311 month
B2022-02-01 ~ 2022-02-061 week
C2022-02-07 ~ 2022-02-131 week
D2022-02-14 ~ 2022-02-201 week
E2022-02-21 ~2022-02-255 days
F2022-02-26 Saturday1 day

Segment G was added later (Sunday 2022-02-27).

  • Now there are 7 segments A~G, the system first tries to merge by month, since Segment A's time length is greater than or equal to the threshold (1 month), it will be excluded. The following segments B-G add up to less than 1 month, do not meet the time threshold (1 month), and therefore cannot be merged by month.

  • The system will try the next level of time thresholds (i.e. merged by week). The system rescans all segments, finds that A, B, C, and D are all greater than or equal to the threshold (1 week), so they are skipped. The following segments E-G add up to the threshold (1 week) and merge into Segment X.

  • With the addition of segment X, the system will be triggered to restart the merge attempt, but the attempt will be terminated because the conditions for automatic merge have not been met.

Segment(Add G, Trigger Auto-Merge)Time RangeTime Length
A2022-01-01 ~ 2022-01-311 month
B2022-02-01 ~ 2022-02-061 week
C2022-02-07 ~ 2022-02-131 week
D2022-02-14 ~ 2022-02-201 week
X(Orignal E-G)2022-02-21 ~ 2022-02-271 week

Add Segment H ( 2022-02-28)

  • Trigger the system to try to merge by month, all segments except A add up to the threshold (1 month), so B-H are merged into Segment Y.

  • With the addition of Segment Y, the system will trigger the merge attempt again, but the conditions for Auto-Merge have not been met, and the attempt is terminated.

Segment(Add H, Trigger Auto-Merge)Time RangeTime Length
A2022-01-01 ~ 2022-01-311 week
Y (Orignal B-H)2022-02-01 ~ 2022-02-281 week

Example 2

There are six consecutive segments A~F, and their own time lengt are all 1 day. At this time, turn on the "auto merge" switch, specify the time threshold as weeks.

Segment (Initial)Time Range
AMonday 2021-12-27
BTuesday 2021-12-28
CWednesday 2021-12-29
DThursday 2021-12-30
EFriday 2021-12-31
FSSaturday 2022-01-01

Then Segment G was added (Sunday 2022-01-02) with a duration of 1 day.

  • At this point there are 7 consecutive Segments, forming a natural week spanning 2 years. The system tries to merge by week, A-E is merged into a new Segment X.
Segment(Add G, Trigger 1st Auto-Merge)Time Range
X(Orignal A-E)Monday to Friday (2021-12-27 ~ 2021-12-31)
FSaturday 2022-01-01
GSunday 2022-01-02
  • With the addition of Segment X, the system will be triggered to merge by week, so F-G will be merged into a new Segment Y.
Segment(Add X, Trigger 2nd Auto-Merge)Time Range
X(Orignal A-E)Monday to Friday (2021-01-27 ~ 2021-01-31)
Y(Orignal F-G)Saturday to Sunday (2022-02-01 ~ 2022-02-02)
  • With the addition of Segment Y, the attempt to merge the system by week is triggered again. Now there are no segments with a duration of 1 week (in each year), so the attempt stops.