[ngram] improve tokenization #33
Labels
domain: datasci
Affects the analyzer's data science logic
enhancement
New feature or request
priority: 1
Highest priority assignment
Background
The current ngram analysis uses a naive tokenizer that splits tokens based on non-word characters. While simple, this approach introduces several problems:
Problems
URLs: The tokenizer breaks URLs into unusable fragments, reducing the quality of analysis.
Non-Space-Separated Languages: Languages like Chinese, Japanese, and Thai, which do not rely on spaces to delimit words, are poorly supported.
Desired Outcome
Explore and implement improved tokenization options to address these issues. Potential solutions could include:
Using advanced NLP libraries (e.g., SpaCy, NLTK, Hugging Face) with language-specific models.
Employing character-based tokenization or pre-trained language models for non-space-separated languages.
Implementing a URL-specific tokenizer or rule-based exceptions for handling URLs.
Tasks
Research and evaluate different tokenization approaches.
Identify libraries or frameworks that can handle multilingual data and URL parsing effectively.
Prototype and benchmark tokenization improvements against the current implementation.
Implement the selected solution and update the ngram analysis pipeline.
The text was updated successfully, but these errors were encountered: