Skip to content

Support for script-continuous languages #112

@stevenvachon

Description

@stevenvachon

For at least the following languages:

  • Chinese (Simplified & Traditional)
  • Japanese
  • Khmer
  • Lao
  • Myanmar
  • Thai
  • Vietnamese

From MDN's documentation for Intl.Segmenter:

const text = '吾輩は猫である。名前はたぬき。';
const japaneseSegmenter = new Intl.Segmenter('ja-JP', { granularity: 'word' });
console.log([...japaneseSegmenter.segment(text)].filter((s) => s.segment));
//-> ['吾輩', 'は', '猫', 'で', 'ある', '。', '名前', 'は', 'たぬき', '。']

Compared with how splitting currently uses String::split:

const text = '吾輩は猫である。名前はたぬき。';
console.log(text.split(/(\s+)/).filter(t => !!t));
//-> ['吾', '輩', 'は', '猫', 'で', 'あ', 'る', '。', '名', '前', 'は', 'た', 'ぬ', 'き', '。']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions