Discussion about this post

User's avatar
Patrick Durusau's avatar

Great analysis, which I am still digesting, but "a visual property abstracted away by the tokenizer." isn't quite right. At least if the tokens are composed of UTF-8 encoded characters. Do you mean base characters, which take combining characters? Being mindful that all Unicode text is "written" from left to right, even if displayed right to left, placing the line breaks at the right margin. See: https://www.unicode.org/versions/Unicode17.0.0/core-spec/

If a transformer knew the character rules for Unicode characters, could it be instructed to use those rules, including the language-specific ones?

Patrick Durusau's avatar

Line breaks are complicated but extensively covered by both literature and code (TeX is only one example). So if the Anthropic model is "learning," why didn't it choose some known method for performing line breaks?

1 more comment...

No posts

Ready for more?