fix(annotations): use Unicode script properties for CJK detection
Address review feedback on #471 from @coderabbitai. The BMP-only codepoint ranges missed two classes of characters: - Non-BMP Han extensions (CJK Unified Ideographs Extension B, C, D, E, F) such as 𠀀. A long string of Extension-B characters would still be tokenized as a single unbreakable unit and overflow the box. - Halfwidth Katakana (U+FF65-U+FF9F) such as カ. Same failure mode. Switch to Unicode script property escapes (\\p{Script=Han}, \\p{Script=Hiragana}, \\p{Script=Katakana}, \\p{Script=Hangul}) which cover these cases without enumerating ranges. tsconfig target is ES2020; property escapes require ES2018+ so this is safe. Verified coverage: 漢 あ ア 가 𠀀 カ all match; A and digits do not.
This commit is contained in:
@@ -10,12 +10,12 @@ import {
|
||||
let blurScratchCanvas: HTMLCanvasElement | null = null;
|
||||
let blurScratchCtx: CanvasRenderingContext2D | null = null;
|
||||
|
||||
// Matches a single code point in Hiragana, Katakana, CJK Unified Ideographs
|
||||
// Extension A, CJK Unified Ideographs, Hangul Syllables, or CJK Compatibility
|
||||
// Ideographs. Used to split CJK text at character boundaries during wrap,
|
||||
// since CJK scripts have no word-separating whitespace.
|
||||
const CJK_CHAR =
|
||||
/[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\uf900-\ufaff]/u;
|
||||
// Matches a single code point whose script is Han (including non-BMP
|
||||
// Extension A-F), Hiragana, Katakana (including halfwidth forms), or
|
||||
// Hangul. Used to split CJK text at character boundaries during wrap,
|
||||
// since CJK scripts have no word-separating whitespace. Unicode script
|
||||
// property escapes require ES2018+; tsconfig target is ES2020.
|
||||
const CJK_CHAR = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/u;
|
||||
|
||||
function tokenizeForWrap(line: string): string[] {
|
||||
// Split Latin text on whitespace (preserving the whitespace as its own token,
|
||||
|
||||
Reference in New Issue
Block a user