Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Tokenizing Ellipsis creates empty tokens #120

@polm

Description

@polm

While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis () was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like ['', '', '…'].

I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions