Great analysis, which I am still digesting, but "a visual property abstracted away by the tokenizer." isn't quite right. At least if the tokens are composed of UTF-8 encoded characters. Do you mean base characters, which take combining characters? Being mindful that all Unicode text is "written" from left to right, even if displayed right to left, placing the line breaks at the right margin. See: https://www.unicode.org/versions/Unicode17.0.0/core-spec/
If a transformer knew the character rules for Unicode characters, could it be instructed to use those rules, including the language-specific ones?
Good question. The "character length" is indeed an oversimplification for general Unicode -- the paper mostly studies ASCII-dominated contexts (source code, fixed-width text) where the distinction doesn't matter. For full Unicode with combining characters and variable-width glyphs, the counting task becomes significantly more complex, and it's an interesting open question whether the model develops different manifold structures for those cases.
Line breaks are complicated but extensively covered by both literature and code (TeX is only one example). So if the Anthropic model is "learning," why didn't it choose some known method for performing line breaks?
Great analysis, which I am still digesting, but "a visual property abstracted away by the tokenizer." isn't quite right. At least if the tokens are composed of UTF-8 encoded characters. Do you mean base characters, which take combining characters? Being mindful that all Unicode text is "written" from left to right, even if displayed right to left, placing the line breaks at the right margin. See: https://www.unicode.org/versions/Unicode17.0.0/core-spec/
If a transformer knew the character rules for Unicode characters, could it be instructed to use those rules, including the language-specific ones?
Good question. The "character length" is indeed an oversimplification for general Unicode -- the paper mostly studies ASCII-dominated contexts (source code, fixed-width text) where the distinction doesn't matter. For full Unicode with combining characters and variable-width glyphs, the counting task becomes significantly more complex, and it's an interesting open question whether the model develops different manifold structures for those cases.
Line breaks are complicated but extensively covered by both literature and code (TeX is only one example). So if the Anthropic model is "learning," why didn't it choose some known method for performing line breaks?