You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pco is designed specifically for numerical data, whereas alternatives rely on
general-purpose (LZ) compressors that target string or binary data.
Pco uses a holistic, 3-step approach:
modes.
Pco identifies an approximate structure of the numbers called a
mode and then uses it to split numbers into "latents".
As an example, if all numbers are approximately multiples of 777, int mult mode
splits each number x into latent variables l_0 and
l_1 such that x = 777 * l_0 + l_1.
Most natural data uses classic mode, which simply matches x = l_0.
delta encoding.
Pco identifies whether certain latent variables would be better compressed as
deltas between consecutive elements (or deltas of deltas, or deltas with
lookback).
If so, it takes differences.
binning.
This is the heart and most novel part of Pco.
Pco represents each (delta-encoded) latent variable as an approximate,
entropy-coded bin paired an exact offset into that bin.
This nears the Shannon entropy of any smooth distribution very efficiently.
These 3 steps cohesively capture most entropy of numerical data without waste.
In contrast, LZ compressors are only effective for patterns like repeating
exact sequences of numbers.
Such patterns constitute just a small fraction of most numerical data's
entropy.
Usage Details
Wrapped or Standalone
Pco is designed to embed into wrapping formats.
It provides a powerful wrapped API with the building blocks to interleave it
with the wrapping format.
This is useful if the wrapping format needs to support things like nullability,
multiple columns, random access, or seeking.
The standalone format is a minimal implementation of a wrapped format.
It supports batched decompression only with no other niceties.
It is mainly recommended for quick proofs of concept and benchmarking.
Granularity
Pco has a hierarchy of multiple batches per page; multiple pages per chunk; and
multiple chunks per file.
By default Pco uses up to 2^18 (~262k) numbers per chunk if available.
unit of ___
size for good compression
chunk
compression
>10k numbers
page
interleaving w/ wrapping format
>1k numbers
batch
decompression
256 numbers (fixed)
Mistakes to Avoid
You may get disappointing results from Pco if your data in a single chunk
combines semantically different sequences, or
contains too few numbers (see above section),
is inherently 2D or higher.
Example: the NYC taxi dataset has f64 columns for fare and
trip_miles.
Suppose we assign these as fare[0...n] and trip_miles[0...n] respectively, where
n=50,000.
separate chunk for each column => good compression
single chunk fare[0], ... fare[n-1], trip_miles[0], ..., trip_miles[n-1] => bad compression
single chunk fare[0], trip_miles[0], ..., fare[n-1], trip_miles[n-1] => bad compression