Clickhouse Source Code Analysis: sample by

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by How is sample by implemented in clickhouse? To use SAMPLE in clickhouse, we need to add sample by expression when we create MergeTree table. If SAMPLE BY is specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer. Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)). Sample by doesn't change the data structure of MergeTree, as the sample by expression is only used for checking grammatical legality and select process. In a select process, sample phase happens before filtering mark ranges. Inside a sample phase, there are two important things: 1 Get the actual sample range. 2 Add the sample range to key conditions, so we can use the sample to skip data. In above code, it's transforming sample rows to sample fraction. Then calculate the upper bound of the sample by expression, which is the max unsigned integer value(255 for uint8, 65535 for uint16 and so on). The lower bound is defined by sample offset. Add conditions based on the lower and upper bound calculated above. These conditions will be used in filtering data by filterPartsByPrimaryKeyAndSkipIndexes functions next. The entire implementation process is surprisingly simple. We can see that sample by a low cardinality field is not useful in sampling. When sampling, we need to use hash function like intHash32 even the field is originally unsigned int. For example, if the sample by field range is UInt8 (0-255) and actual values are 0-25, we may read all data when we select sample 0.1. I feel it's better to calculate the upper and lower bounds from the mark file, comparing with directly using the maximum value of the field type. Of course, if you use the hash function directly, you won't have this problem. When sample doesn't achieve better performance as we wanted, it might be that sample by expression ranks low in order by sequence. Like order by (a, b, c), if we don't use a and b in the where clause, sample by c may not achieve good data skipping effects.

Jan 17, 2025 - 09:30
Clickhouse Source Code Analysis: sample by

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by

How is sample by implemented in clickhouse?

To use SAMPLE in clickhouse, we need to add sample by expression when we create MergeTree table.

If SAMPLE BY is specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer.

Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).

Sample by doesn't change the data structure of MergeTree, as the sample by expression is only used for checking grammatical legality and select process.

Image description
In a select process, sample phase happens before filtering mark ranges.

Inside a sample phase, there are two important things:
1 Get the actual sample range.
2 Add the sample range to key conditions, so we can use the sample to skip data.

Transform sample by rows
In above code, it's transforming sample rows to sample fraction.

Get max value
Then calculate the upper bound of the sample by expression, which is the max unsigned integer value(255 for uint8, 65535 for uint16 and so on).
The lower bound is defined by sample offset.

Add sample filter conditions
Add conditions based on the lower and upper bound calculated above.
These conditions will be used in filtering data by filterPartsByPrimaryKeyAndSkipIndexes functions next.

The entire implementation process is surprisingly simple.
We can see that sample by a low cardinality field is not useful in sampling.
When sampling, we need to use hash function like intHash32 even the field is originally unsigned int. For example, if the sample by field range is UInt8 (0-255) and actual values are 0-25, we may read all data when we select sample 0.1. I feel it's better to calculate the upper and lower bounds from the mark file, comparing with directly using the maximum value of the field type. Of course, if you use the hash function directly, you won't have this problem.

When sample doesn't achieve better performance as we wanted, it might be that sample by expression ranks low in order by sequence. Like order by (a, b, c), if we don't use a and b in the where clause, sample by c may not achieve good data skipping effects.