Explore Strategies for Handling Cardinality Explosion in Analyzer Modules #20

soul-codes · 2024-12-02T04:19:19Z

Background

Some of our analyzer modules face significant challenges due to high cardinality expansion during intermediate processing stages:

Ngram Test: Causes a high cardinality expansion factor (e.g., 300x increase).
Time Coordination Test: Incurs supralinear expansion (O(N²) due to pairwise joins).

These challenges are particularly problematic for larger datasets, as they can lead to memory exhaustion or performance degradation.

Polars:
- All of our analyzers are currently using Polars.
- While Polars offers excellent performance for smaller datasets, it struggles with high cardinality scenarios, eventually running out of memory.
- This limitation reduces the practical usefulness of the analyzers for larger datasets.
Dask:
- Preliminary experiments with Dask show potential for out-of-core processing.
- However, a 1-to-1 refactor of the Polars workflow to Dask results in suboptimal performance, especially for the Ngram Test, indicating the need for additional workflow optimization.
Custom Out-of-Core Polars:
- Experimental homemade solutions have shown promise for handling large datasets effectively.
- However, they introduce maintenance overhead and a steeper learning curve compared to established libraries like Dask.
Pandas Prototypes:
- Existing prototypes implemented in Pandas are functional but single-threaded, lacking Polars' multicore advantages or Dask's distributed computing capabilities.
- Processing with Pandas often takes hours, making it impractical for immediate utility.

The purpose of this issue is to encourage experimentation and discussion around potential solutions to address these challenges effectively:

Immediate Focus:
- The current CLI runs on a single machine, so solutions should prioritize practicality for this setup.
- Slower but functional solutions might be acceptable in the short term, provided they enable the analyzers to handle larger datasets.
Long-Term Vision:
- The product may evolve to leverage distributed computing nodes in the future, but the timeline for this transition is uncertain.
Algorithmic Alternatives:
- Explore replacing exact tests (e.g., exact pairwise co-occurrence counts) with statistical approximations or heuristic algorithms (e.g., k-means clustering, time-series analysis).
- These approaches could mitigate the cardinality explosion while maintaining analytical utility.
Tool Considerations:
- Investigate optimized workflows with Dask or alternative frameworks for distributed computing.
- Assess the feasibility of refining out-of-core solutions for Polars or other libraries to reduce maintenance overhead.

We invite contributions and ideas for:

Evaluating existing libraries and tools for better handling of cardinality explosion.
Proposing alternative algorithms or workflows to reduce cardinality expansion.
Experimenting with and benchmarking potential solutions for scalability and performance.

Your insights and suggestions are valuable as we work to make the analyzer modules more robust and capable of handling larger datasets.

soul-codes added the domain: datasci Affects the analyzer's data science logic label Dec 8, 2024