Fix some things in README.md after proofreading it and removing lies.

3 years ago · a8320613a1
parent b9be485610
commit a8320613a1
1 changed files with 6 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -105,8 +105,9 @@ The command line flags for this are:
  * `--inference-server-api-path` sets which path servers the API requests. The
    default path is `/rllama/v1/inference`
  * `--inference-server-prompt-cache-size` sets how many previous prompt
-    calculations should be cached. Default is 1000. This speeds up token
-    generation for prompts that were already requested before.
+    calculations should be cached. Default is 50. This speeds up token
+    generation for prompts that were already requested before, however it also
+    increases memory use as the cache gets more full.

 Prompts and flags related to token sampling are all ignored in inference server
 mode. Instead, they are obtained from each HTTP JSON API request.
@ -123,7 +124,7 @@ Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.

 The expected JSON is as follows:

-```json
+```
  {
     "temperature":        <number, optional>
     "top_k":              <integer, optional, default 20>
@ -146,7 +147,7 @@ the probabilities for every token are returned instead.

 When no\_token\_sampling = false:

-```json
+```
 {<token string>: {"p": <number>, "is_end_token": bool, might not be present}}
 ```

@ -160,14 +161,12 @@ When no\_token\_sampling = false:

 When no\_token\_sampling = true:

-```json
+```
 {<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
 ,<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
 ,...}
 ```

-Tokens where `p = 0` will not be present in the JSON output.
-
 If you want to implement your own token sampling, you may want to set
 `max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own
 sampling behavior entirely.