This repository has been archived by the owner on Nov 6, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #10 from javacatknight/main
Add high-level dev notes, scaffold comments
- Loading branch information
Showing
10 changed files
with
855 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
#Original Repo | ||
https://github.com/cwida/fsst/tree/master | ||
* "...12..." files seem to be older files. | ||
|
||
#Codebase | ||
* Sanity folder - minimal code to covert from. More at the original repo | ||
|
||
#Technical Java Notes | ||
1. C/C++ char 1 byte. Java char 2 bytes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,4 +7,4 @@ public class Encoder { | |
Counters counters; | ||
int simdbuf[] = new int[FSST_BUFSZ]; | ||
|
||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# FOREWORD: | ||
If you're only interested in developer code, skip background. Ongoing. | ||
|
||
# TABLE OF CONTENTS | ||
1. [Background](#background) | ||
2. [Summary](#summary) | ||
3. [Overview](#overview) | ||
|
||
# BACKGROUND <a name="background"></a> | ||
Dictionary compression : uniquely matches strings to fixed-size integers. | ||
- Effective only if repeating strings, i.e. similiar words lose benefit | ||
- Also if applied to fraction of a whole relation, ineffective | ||
- Most srings stored are generally less than 200 bytes and often less than 30 bytes per string | ||
|
||
LZ4 (dictionary compression example) | ||
- Not efficient for compressing individual strings - requires kB input size for efficient compression | ||
- So it's used to compress columnar blocks (many string values together) | ||
- Therefore prevents random access; | ||
- Example: decompressing large blocks for these values, some of which goes unused. | ||
|
||
Potential: | ||
- Use in conjunction with dictionary compression - i.e. after data is compressed, FSST can compress the strings in the dictionary | ||
- Can apply on existing database systems | ||
- Compressed Query Processing - Can complete equality comparisons on the compressed, without needing decompression | ||
|
||
|
||
# SUMMARY <a name="summary"></a> | ||
FSST - Fast Static Symbol Table | ||
## Compression | ||
* Replace frequently-occuring substrings of 1-8 bytes with 1-byte codes. | ||
* Remaining symbols/symbols that don't frequently occur are escaped, to indicate they should be copied as is. Result of symbol table being limited (256 bytes). Reserve the last byte of table for an escape byte. | ||
|
||
### Algorithm | ||
* Ties are resolved randomly | ||
|
||
## Decompression: | ||
* Translate each 1-byte code into its symbolic substring, using an immutable array table (256 entries) | ||
|
||
|
||
# OVERVIEW <a name="overview"></a> | ||
## Decompression Algorithm: | ||
- Decompress into symbols and store as 8-byte word in array. | ||
/** */ | ||
void decodeBasic (int[] in, int[] out, symbolTable, actualLengthOfSymbols){ | ||
int code = *in++; //Dereference to get (*in) before the in pointer is moved forwards. | ||
*out = sym[code]; //Translate the symbol, cast to 8 byte word and put it into outtput buffer | ||
out+= len[ccode]; //Moves the pointer head forwards to the new out[0]/next place to write. | ||
} | ||
|
||
void decodeWithEscape (...) { | ||
if (code == 255) | ||
*out++ = *in++; //Copy the escape character. | ||
} | ||
/***/ | ||
|
||
## Compression: | ||
* findLongestSymbol() finds the longest matching symbol at the current input position. If no matching symbol is found. The input byte is escaped. | ||
|
||
## Symbol Table Construction | ||
- Choosing the 256 symbols | ||
- Naive greedy single-pass: count and pick the most frequent occured. Con: does not consider overlapping symbols ex. ("http://w", ttp://www) and if sequential read-in, shorter symbols will be consumed long before the better/longer symbols (h before ttp://w) | ||
- Actual iterative algorithm - Linear time, multiple (ex. 5) iterations, and on-the-fly compression, bottom-up | ||
- Concatenate short symbols to longer symbols | ||
- Multiple iterations update the table, add new symbols, remove bad symbols | ||
- Base case: empty symbol table | ||
- Each iteration: | ||
1. Iterate over the uncompressed input and compress with existing symbol table, count frequency | ||
2. Select the highest-gain symbols to construct a new symbol table. Choose from: | ||
* Old table | ||
* New symbols generated by concatenating pairs (2) symbols | ||
* Reconsider all symbols that consist of a single byte | ||
* Each existing symbol concatenated with the next occuring byte (even if that single byte is not currently a symbol) | ||
- Ties for gain are resolved randomly for symbols | ||
|
||
<!-- | ||
Variables: | ||
- SymbolTable st == current table | ||
- count1[], count2[][] == frequencies of the codes | ||
buildSymbolTable(SymbolTable st) | ||
- 5 iterations | ||
- Initialize st.nSymbols = 0 | ||
- Initialize new symboltable(). Field st.symbols[] starts with 256 pseudo symbols == escaped bytes. | ||
- In the array, the next st.nSymbols (number of symbols), up to 255, contain the real symbols. | ||
- ??? | ||
compressCount(SymbolTable st, count1, count2, text) | ||
- Initial symboltable is empty, uses all escaped bytes, input size doubled. | ||
- Does not produce compressed text, just records the frequency of the codes or bytes it encounters | ||
* count1[] | ||
--> | ||
|
||
|
Oops, something went wrong.