Set Digest functions
Set Digest functions
Trino offers several functions that deal with the MinHash technique.
MinHash is used to quickly estimate the Jaccard similarity coefficient between two sets.
It is commonly used in data mining to detect near-duplicate web pages at scale. By using this information, the search engines efficiently avoid showing within the search results two pages that are nearly identical.
The following example showcases how the Set Digest functions can be used
to naively estimate the similarity between texts. The input texts are
split by using the function ngrams
to
4-shingles which are used
as input for creating a set digest of each initial text. The set digests
are compared to each other to get an approximation of the similarity of
their corresponding initial texts:
Results:
id1 | id2 | intersection_cardinality | jaccard_index |
---|---|---|---|
1 | 2 | 0 | 0.0 |
1 | 3 | 4 | 0.6 |
2 | 3 | 0 | 0.0 |
The above result listing points out, as expected, that the texts with the id 1 and 3 are quite similar.
One may argue that the text with the id 2 is somewhat similar to the texts with the id 1 and 3. Due to the fact in the example above 4-shingles are taken into account for measuring the similarity of the texts, there are no intersections found for the text pairs 1 and 2, respectively 3 and 2 and therefore there the similarity index for these text pairs is 0.
Data structures
Trino implements Set Digest data sketches by encapsulating the following components:
The HyperLogLog structure is used for the approximation of the distinct elements in the original set.
The MinHash structure is used to store a low memory footprint signature of the original set. The similarity of any two sets is estimated by comparing their signatures.
The Trino type for this data structure is called setdigest
. Trino
offers the ability to merge multiple Set Digest data sketches.
Serialization
Data sketches can be serialized to and deserialized from varbinary
.
This allows them to be stored for later use.
Functions
make_set_digest()
make_set_digest(x)
→ setdigest
Composes all input values of x
into a setdigest
.
Create a setdigest
corresponding to a bigint
array:
Create a setdigest
corresponding to a varchar
array:
SELECT make_set_digest(value) FROM (VALUES ‘Trino’, ‘SQL’, ‘on’, ‘everything’) T(value);
merge_set_digest()
merge_set_digest(setdigest, setdigest)
→ setdigest
Returns the setdigest
of the aggregate union of the individual
setdigest
Set Digest structures.
Returns the cardinality of the set digest from its internal
HyperLogLog
component.
Examples:
intersection_cardinality()
intersection_cardinality(x,y)
→ long
Returns the estimation for the cardinality of the intersection of the two set digests.
x
and y
must be of type setdigest
Examples:
jaccard_index()
jaccard_index(x, y)
→ double
Returns the estimation of Jaccard index for the two set digests.
x
and y
must be of type setdigest
.
Examples:
hash_counts()
hash_counts(x)
→ map(bigint, bigint)
Returns a map containing the
Murmur3Hash128
hashed values and the count of their occurences within the internal
MinHash
structure belonging to x
.
x
must be of type setdigest
.
Examples:
Was this page helpful?