上面 Claude 100K Token 上下文并不代表支持英文的 10 万个词,而大约是 75000 个单词(Words)。Token 是自然语言处理中的文本基本单位,它可以是一个单词/标点/词组,字符(Characters)是指文本中的单个符号,例如汉字/英文单个字母/数字/标点/空格都是。(有的笔记软件文本统计大家可能熟悉)
比如 I want a pizza,4 Tokens 14 Characters;我想要一个披萨,15 Tokens 7 Characters.
Why is GPT-3 15.77x more expensive for certain languages?
Token 和字符之间的关系取决于具体的标记化(Tokenization),分词器(Tokenizers),将字符和单词组合成通用模式的方法。这意味着不同的语言处理,Token 的消耗是不同的,甚至有高达 15 倍之多的语言差异,所有之前有文章提到过 GPT 对非英语母语使用者的歧视。
现在英语可以说是最强「编程语言」,使用 GPT 等生成工具时可尽量英文输入,以支持更多内容/更好效果。甚至还有个优化单词输入的网站,以更少的 Token 达到同样输出效果,后面翻到了再补充。
Reference
English is the new programming language shit.
Tokenization and characters (characters) are two basic units in natural language processing. Tokens can be words, punctuation, or phrases, while characters refer to individual symbols in text, such as Chinese characters, English letters, numbers, punctuation, and spaces. Tokenization depends on the specific tokenization and tokenizer used, which combine characters and words into a universal pattern. Therefore, the consumption of tokens varies depending on the language being processed, with differences of up to 15 times. It is recommended to use English input when using language generation tools such as GPT to support more content and better results. Additionally, there are websites that optimize word inputs to achieve the same output effect with fewer tokens.