[ngram] improve tokenization #33

soul-codes · 2024-12-06T20:05:37Z

Background

The current ngram analysis uses a naive tokenizer that splits tokens based on non-word characters. While simple, this approach introduces several problems:

Problems

URLs: The tokenizer breaks URLs into unusable fragments, reducing the quality of analysis.
Non-Space-Separated Languages: Languages like Chinese, Japanese, and Thai, which do not rely on spaces to delimit words, are poorly supported.

Desired Outcome

Explore and implement improved tokenization options to address these issues. Potential solutions could include:

Using advanced NLP libraries (e.g., SpaCy, NLTK, Hugging Face) with language-specific models.
Employing character-based tokenization or pre-trained language models for non-space-separated languages.
Implementing a URL-specific tokenizer or rule-based exceptions for handling URLs.

Tasks

Research and evaluate different tokenization approaches.
Identify libraries or frameworks that can handle multilingual data and URL parsing effectively.
Prototype and benchmark tokenization improvements against the current implementation.
Implement the selected solution and update the ngram analysis pipeline.

sandytribal · 2025-01-09T19:28:30Z

Also apostrophes treated as spaces in tokenization

soul-codes added enhancement New feature or request domain: datasci Affects the analyzer's data science logic labels Dec 6, 2024

andi-halim self-assigned this Jan 9, 2025

andi-halim added the priority: 1 Highest priority assignment label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ngram] improve tokenization #33

[ngram] improve tokenization #33

soul-codes commented Dec 6, 2024

sandytribal commented Jan 9, 2025

[ngram] improve tokenization #33

[ngram] improve tokenization #33

Comments

soul-codes commented Dec 6, 2024

Background

Problems

Desired Outcome

Tasks

sandytribal commented Jan 9, 2025