Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ngram] improve tokenization #33

Open
soul-codes opened this issue Dec 6, 2024 · 1 comment
Open

[ngram] improve tokenization #33

soul-codes opened this issue Dec 6, 2024 · 1 comment
Assignees
Labels
domain: datasci Affects the analyzer's data science logic enhancement New feature or request priority: 1 Highest priority assignment

Comments

@soul-codes
Copy link
Collaborator

Background

The current ngram analysis uses a naive tokenizer that splits tokens based on non-word characters. While simple, this approach introduces several problems:

Problems

  • URLs: The tokenizer breaks URLs into unusable fragments, reducing the quality of analysis.

  • Non-Space-Separated Languages: Languages like Chinese, Japanese, and Thai, which do not rely on spaces to delimit words, are poorly supported.

Desired Outcome

Explore and implement improved tokenization options to address these issues. Potential solutions could include:

  • Using advanced NLP libraries (e.g., SpaCy, NLTK, Hugging Face) with language-specific models.

  • Employing character-based tokenization or pre-trained language models for non-space-separated languages.

  • Implementing a URL-specific tokenizer or rule-based exceptions for handling URLs.

Tasks

  1. Research and evaluate different tokenization approaches.

  2. Identify libraries or frameworks that can handle multilingual data and URL parsing effectively.

  3. Prototype and benchmark tokenization improvements against the current implementation.

  4. Implement the selected solution and update the ngram analysis pipeline.

@soul-codes soul-codes added enhancement New feature or request domain: datasci Affects the analyzer's data science logic labels Dec 6, 2024
@sandytribal
Copy link
Collaborator

Also apostrophes treated as spaces in tokenization

Screenshot 2025-01-09 at 2 27 10 PM

@andi-halim andi-halim self-assigned this Jan 9, 2025
@andi-halim andi-halim added the priority: 1 Highest priority assignment label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: datasci Affects the analyzer's data science logic enhancement New feature or request priority: 1 Highest priority assignment
Projects
None yet
Development

No branches or pull requests

3 participants