[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

soul-codes · 2024-12-06T20:33:12Z

Background

Currently, our analyzer modules have hard-coded parameters, limiting flexibility and reusability. For example:

The ngram analyzer fixes the gram lengths to [3, 5].
The time frequency analyzer and hashtags analyzer fix the analysis window to predefined values.

This approach makes it difficult to adapt the analyzers to different use cases without modifying code. Furthermore, running multiple analyses with different parameter sets overwrites previous results due to workspace limitations.

Desired Outcome

Parameterization of Analyzers
- Enhance analyzer declarations to accept standardized parameters.
- Parameters should support various data types, formalized in the declaration.
- Ensure the CLI interprets these parameters generically and renders appropriate prompts for user input.
Workspace Enhancement
- Modify the workspace to persist results from multiple runs of the same analyzer with different parameter sets.
- This allows end users to compare or reuse results without overwriting.

Impact

This enhancement will:

Improve analyzer flexibility and reusability.
Allow users to experiment with and compare different parameter configurations without workflow disruption.
Set a foundation for more complex analysis workflows in the future.

Acceptance Criteria

All analyzers--wherever appropriate--support parameterized declarations with formalized data types.
CLI interprets parameter declarations and generates prompts dynamically.
Workspace design supports storing and organizing multiple analysis runs for the same analyzer.
Documentation is updated to reflect the new analyzer parameterization system and workspace functionality.

Open Questions

How should parameter sets be defined and passed to the CLI for consistency?
What metadata should be attached to persisted runs for effective retrieval and comparison?
Should there be a naming convention or unique identifiers for persisted analysis runs?

The text was updated successfully, but these errors were encountered:

KristijanArmeni · 2024-12-11T15:12:08Z

This is great @soul-codes ! I'm thinking that a somewhat related (maybe more general) point is to make analyzers standalone python modules/functions that would interface with CLI, rather then being integral part of it. I guess that goes hand in hand with the ability to parametrize an analyzer (?).

Ideally that would look something like in a python script (i.e. outside CLI):

from mangotango.analyzers import HashtagAnalyzer, NgramAnalyzer  # could also be a function etc.

config = {timewindow: "2hr"}
analyzer = HashtagAnalyzer(config)
output = analyzer(dataset="some/twitter/dataset.csv")  # or something similar

config = {n: 3}
analyzer = NgramAnalyzer(config)
output = analyzer(dataset="some/twitter/dataset.csv")  # or something similar

Not sure how much overhaul that would be within the current CLI design. Currently to write unit tests, I had to do workarounds like so:

from analyzers.hashtags.main import gini, main

class DummyOutput():
    "Dummy output object that contains the parquet path attribute."
    def __init__(self):
        self.parquet_path = "test_hashtags.parquet"

class AnalyzerContextDummy():

  """Dummy object that allows us to access test data and output a dummy output object."""

  def get_test_df(self):
      return test_df

  def output(self, output_id: str):
      return DummyOutput()


context = AnalyzerContextDummy()
main(context)

soul-codes added enhancement New feature or request domain: core Affects the app's core architecture domain: workspace Affects workspace object management labels Dec 6, 2024

soul-codes mentioned this issue Dec 8, 2024

Parametrize the user-inputted time interval for the Time Frequency Analysis #21

Open

soul-codes self-assigned this Dec 18, 2024

soul-codes mentioned this issue Dec 20, 2024

Feat/multiple analysis runs #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

soul-codes commented Dec 6, 2024

KristijanArmeni commented Dec 11, 2024

[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

[core] Enhance Analyzers with Standardized, Parameterized Declarations and Support for Multiple Runs #34

Comments

soul-codes commented Dec 6, 2024

Background

Desired Outcome

Impact

Acceptance Criteria

Open Questions

KristijanArmeni commented Dec 11, 2024