Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Strategies for Handling Cardinality Explosion in Analyzer Modules #20

Open
soul-codes opened this issue Dec 2, 2024 · 0 comments
Labels
domain: datasci Affects the analyzer's data science logic

Comments

@soul-codes
Copy link
Collaborator

Background

Some of our analyzer modules face significant challenges due to high cardinality expansion during intermediate processing stages:

  1. Ngram Test: Causes a high cardinality expansion factor (e.g., 300x increase).
  2. Time Coordination Test: Incurs supralinear expansion (O(N²) due to pairwise joins).

These challenges are particularly problematic for larger datasets, as they can lead to memory exhaustion or performance degradation.

Observations and Current State

  1. Polars:

    • All of our analyzers are currently using Polars.
    • While Polars offers excellent performance for smaller datasets, it struggles with high cardinality scenarios, eventually running out of memory.
    • This limitation reduces the practical usefulness of the analyzers for larger datasets.
  2. Dask:

    • Preliminary experiments with Dask show potential for out-of-core processing.
    • However, a 1-to-1 refactor of the Polars workflow to Dask results in suboptimal performance, especially for the Ngram Test, indicating the need for additional workflow optimization.
  3. Custom Out-of-Core Polars:

    • Experimental homemade solutions have shown promise for handling large datasets effectively.
    • However, they introduce maintenance overhead and a steeper learning curve compared to established libraries like Dask.
  4. Pandas Prototypes:

    • Existing prototypes implemented in Pandas are functional but single-threaded, lacking Polars' multicore advantages or Dask's distributed computing capabilities.
    • Processing with Pandas often takes hours, making it impractical for immediate utility.

Goals and Discussion Points

The purpose of this issue is to encourage experimentation and discussion around potential solutions to address these challenges effectively:

  1. Immediate Focus:

    • The current CLI runs on a single machine, so solutions should prioritize practicality for this setup.
    • Slower but functional solutions might be acceptable in the short term, provided they enable the analyzers to handle larger datasets.
  2. Long-Term Vision:

    • The product may evolve to leverage distributed computing nodes in the future, but the timeline for this transition is uncertain.
  3. Algorithmic Alternatives:

    • Explore replacing exact tests (e.g., exact pairwise co-occurrence counts) with statistical approximations or heuristic algorithms (e.g., k-means clustering, time-series analysis).
    • These approaches could mitigate the cardinality explosion while maintaining analytical utility.
  4. Tool Considerations:

    • Investigate optimized workflows with Dask or alternative frameworks for distributed computing.
    • Assess the feasibility of refining out-of-core solutions for Polars or other libraries to reduce maintenance overhead.

Call to Action

We invite contributions and ideas for:

  • Evaluating existing libraries and tools for better handling of cardinality explosion.
  • Proposing alternative algorithms or workflows to reduce cardinality expansion.
  • Experimenting with and benchmarking potential solutions for scalability and performance.

Your insights and suggestions are valuable as we work to make the analyzer modules more robust and capable of handling larger datasets.

@soul-codes soul-codes added the domain: datasci Affects the analyzer's data science logic label Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: datasci Affects the analyzer's data science logic
Projects
None yet
Development

No branches or pull requests

1 participant