Summary

Idea

Basically use gzip to compress the text, which is then used as a representation of the text.

The idea is that similar texts will have similar length of concatenation.

C(x) = compressed length

x1 = Japan’s Seiko Epson Corp. has developed a 12-gram flying microrobot.
x2 = The latest tiny flying robot has been unveiled in Japan.
x3 = Michael Phelps won the gold medal in the 400 individual medley.

Since x2 is simlar to x1, and much different to x3, C(x1+x2) - C(x1) < C(x1+x3) - c(x1). The + operator here means concatenating texts together.

We can then use a distance-based classifier (like KNN) to classify texts based on the computed distance.

import gzip
import numpy as np

for(x1, _) in test_set:
    # Compress X in test set
    Cx1 = len(gzip.compress(x1.encode()))

    distance_from_x1 = []

    for(x2, _) in training_set:
        # Compress X in training set
        Cx2 = len(gzip.compress(x2.encode()))

        # Compute distance between the example in the test set and all the examples in
        # the training set
        x1x2 = "".join([x1,x2])
        Cx1x2 = len(gzip.compress(x1x2.encode()))
        ncd = (Cx1x2 - min(Cx1,Cx2)) / max(Cx1, Cx2)

        distance_from_x1.append(ncd)

# Get top p
sorted_idx = np.argsort(np.array(distance_from_x1))
top_k_class = training_set[sorted_idx[:k], 1]
predict_class = max(set(top_k_class), key=top_k_class.count)

Performance

This is much faster and requires less resource than using DNN approaches
Performs really well when there's not much labeled data: Outperformed DNN approaches in few-shot learning. As we increase the number of shots, both gzip and DNN performances become similar.
gzip turned out to be the most well rounded in terms of performance and accuracy

Links

Repo: https://github.com/bazingagin/npc_gzip
Discussion: https://news.ycombinator.com/item?id=36707193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Summary

Idea

Performance

Links

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summary

Idea

Performance

Links