Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

Open
4 tasks
soul-codes opened this issue Dec 6, 2024 · 1 comment
Assignees
Labels
domain: core Affects the app's core architecture domain: workspace Affects workspace object management enhancement New feature or request

Comments

@soul-codes
Copy link
Collaborator

Background

Currently, our analyzer modules have hard-coded parameters, limiting flexibility and reusability. For example:

  • The ngram analyzer fixes the gram lengths to [3, 5].
  • The time frequency analyzer and hashtags analyzer fix the analysis window to predefined values.

This approach makes it difficult to adapt the analyzers to different use cases without modifying code. Furthermore, running multiple analyses with different parameter sets overwrites previous results due to workspace limitations.

Desired Outcome

  1. Parameterization of Analyzers

    • Enhance analyzer declarations to accept standardized parameters.
    • Parameters should support various data types, formalized in the declaration.
    • Ensure the CLI interprets these parameters generically and renders appropriate prompts for user input.
  2. Workspace Enhancement

    • Modify the workspace to persist results from multiple runs of the same analyzer with different parameter sets.
    • This allows end users to compare or reuse results without overwriting.

Impact

This enhancement will:

  • Improve analyzer flexibility and reusability.
  • Allow users to experiment with and compare different parameter configurations without workflow disruption.
  • Set a foundation for more complex analysis workflows in the future.

Acceptance Criteria

  • All analyzers--wherever appropriate--support parameterized declarations with formalized data types.
  • CLI interprets parameter declarations and generates prompts dynamically.
  • Workspace design supports storing and organizing multiple analysis runs for the same analyzer.
  • Documentation is updated to reflect the new analyzer parameterization system and workspace functionality.

Open Questions

  • How should parameter sets be defined and passed to the CLI for consistency?
  • What metadata should be attached to persisted runs for effective retrieval and comparison?
  • Should there be a naming convention or unique identifiers for persisted analysis runs?
@soul-codes soul-codes added enhancement New feature or request domain: core Affects the app's core architecture domain: workspace Affects workspace object management labels Dec 6, 2024
@KristijanArmeni
Copy link
Collaborator

This is great @soul-codes ! I'm thinking that a somewhat related (maybe more general) point is to make analyzers standalone python modules/functions that would interface with CLI, rather then being integral part of it. I guess that goes hand in hand with the ability to parametrize an analyzer (?).

Ideally that would look something like in a python script (i.e. outside CLI):

from mangotango.analyzers import HashtagAnalyzer, NgramAnalyzer  # could also be a function etc.

config = {timewindow: "2hr"}
analyzer = HashtagAnalyzer(config)
output = analyzer(dataset="some/twitter/dataset.csv")  # or something similar

config = {n: 3}
analyzer = NgramAnalyzer(config)
output = analyzer(dataset="some/twitter/dataset.csv")  # or something similar

Not sure how much overhaul that would be within the current CLI design. Currently to write unit tests, I had to do workarounds like so:

from analyzers.hashtags.main import gini, main

class DummyOutput():
    "Dummy output object that contains the parquet path attribute."
    def __init__(self):
        self.parquet_path = "test_hashtags.parquet"

class AnalyzerContextDummy():

  """Dummy object that allows us to access test data and output a dummy output object."""

  def get_test_df(self):
      return test_df

  def output(self, output_id: str):
      return DummyOutput()


context = AnalyzerContextDummy()
main(context)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: core Affects the app's core architecture domain: workspace Affects workspace object management enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants