METADATA 6.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
  1. Metadata-Version: 2.4
  2. Name: tiktoken
  3. Version: 0.12.0
  4. Summary: tiktoken is a fast BPE tokeniser for use with OpenAI's models
  5. Author: Shantanu Jain
  6. Author-email: shantanu@openai.com
  7. License: MIT License
  8. Copyright (c) 2022 OpenAI, Shantanu Jain
  9. Permission is hereby granted, free of charge, to any person obtaining a copy
  10. of this software and associated documentation files (the "Software"), to deal
  11. in the Software without restriction, including without limitation the rights
  12. to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
  13. copies of the Software, and to permit persons to whom the Software is
  14. furnished to do so, subject to the following conditions:
  15. The above copyright notice and this permission notice shall be included in all
  16. copies or substantial portions of the Software.
  17. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  18. IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  19. FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  20. AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  21. LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  22. OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  23. SOFTWARE.
  24. Project-URL: homepage, https://github.com/openai/tiktoken
  25. Project-URL: repository, https://github.com/openai/tiktoken
  26. Project-URL: changelog, https://github.com/openai/tiktoken/blob/main/CHANGELOG.md
  27. Requires-Python: >=3.9
  28. Description-Content-Type: text/markdown
  29. License-File: LICENSE
  30. Requires-Dist: regex>=2022.1.18
  31. Requires-Dist: requests>=2.26.0
  32. Provides-Extra: blobfile
  33. Requires-Dist: blobfile>=2; extra == "blobfile"
  34. Dynamic: license-file
  35. # ⏳ tiktoken
  36. tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
  37. OpenAI's models.
  38. ```python
  39. import tiktoken
  40. enc = tiktoken.get_encoding("o200k_base")
  41. assert enc.decode(enc.encode("hello world")) == "hello world"
  42. # To get the tokeniser corresponding to a specific model in the OpenAI API:
  43. enc = tiktoken.encoding_for_model("gpt-4o")
  44. ```
  45. The open source version of `tiktoken` can be installed from [PyPI](https://pypi.org/project/tiktoken):
  46. ```
  47. pip install tiktoken
  48. ```
  49. The tokeniser API is documented in `tiktoken/core.py`.
  50. Example code using `tiktoken` can be found in the
  51. [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
  52. ## Performance
  53. `tiktoken` is between 3-6x faster than a comparable open source tokeniser:
  54. ![image](https://raw.githubusercontent.com/openai/tiktoken/main/perf.svg)
  55. Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
  56. `tokenizers==0.13.2`, `transformers==4.24.0` and `tiktoken==0.2.0`.
  57. ## Getting help
  58. Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
  59. If you work at OpenAI, make sure to check the internal documentation or feel free to contact
  60. @shantanu.
  61. ## What is BPE anyway?
  62. Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens).
  63. Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable
  64. properties:
  65. 1) It's reversible and lossless, so you can convert tokens back into the original text
  66. 2) It works on arbitrary text, even text that is not in the tokeniser's training data
  67. 3) It compresses the text: the token sequence is shorter than the bytes corresponding to the
  68. original text. On average, in practice, each token corresponds to about 4 bytes.
  69. 4) It attempts to let the model see common subwords. For instance, "ing" is a common subword in
  70. English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing"
  71. (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and
  72. again in different contexts, it helps models generalise and better understand grammar.
  73. `tiktoken` contains an educational submodule that is friendlier if you want to learn more about
  74. the details of BPE, including code that helps visualise the BPE procedure:
  75. ```python
  76. from tiktoken._educational import *
  77. # Train a BPE tokeniser on a small amount of text
  78. enc = train_simple_encoding()
  79. # Visualise how the GPT-4 encoder encodes text
  80. enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
  81. enc.encode("hello world aaaaaaaaaaaa")
  82. ```
  83. ## Extending tiktoken
  84. You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
  85. **Create your `Encoding` object exactly the way you want and simply pass it around.**
  86. ```python
  87. cl100k_base = tiktoken.get_encoding("cl100k_base")
  88. # In production, load the arguments directly instead of accessing private attributes
  89. # See openai_public.py for examples of arguments for specific encodings
  90. enc = tiktoken.Encoding(
  91. # If you're changing the set of special tokens, make sure to use a different name
  92. # It should be clear from the name what behaviour to expect.
  93. name="cl100k_im",
  94. pat_str=cl100k_base._pat_str,
  95. mergeable_ranks=cl100k_base._mergeable_ranks,
  96. special_tokens={
  97. **cl100k_base._special_tokens,
  98. "<|im_start|>": 100264,
  99. "<|im_end|>": 100265,
  100. }
  101. )
  102. ```
  103. **Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
  104. This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
  105. option 1.
  106. To do this, you'll need to create a namespace package under `tiktoken_ext`.
  107. Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
  108. ```
  109. my_tiktoken_extension
  110. ├── tiktoken_ext
  111. │   └── my_encodings.py
  112. └── setup.py
  113. ```
  114. `my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
  115. This is a dictionary from an encoding name to a function that takes no arguments and returns
  116. arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
  117. `tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
  118. Your `setup.py` should look something like this:
  119. ```python
  120. from setuptools import setup, find_namespace_packages
  121. setup(
  122. name="my_tiktoken_extension",
  123. packages=find_namespace_packages(include=['tiktoken_ext*']),
  124. install_requires=["tiktoken"],
  125. ...
  126. )
  127. ```
  128. Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
  129. custom encodings! Make sure **not** to use an editable install.