Fix some things in README.md after proofreading it and removing lies.

master
Mikko Juola 3 years ago
parent b9be485610
commit a8320613a1

@ -105,8 +105,9 @@ The command line flags for this are:
* `--inference-server-api-path` sets which path servers the API requests. The
default path is `/rllama/v1/inference`
* `--inference-server-prompt-cache-size` sets how many previous prompt
calculations should be cached. Default is 1000. This speeds up token
generation for prompts that were already requested before.
calculations should be cached. Default is 50. This speeds up token
generation for prompts that were already requested before, however it also
increases memory use as the cache gets more full.
Prompts and flags related to token sampling are all ignored in inference server
mode. Instead, they are obtained from each HTTP JSON API request.
@ -123,7 +124,7 @@ Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.
The expected JSON is as follows:
```json
```
{
"temperature": <number, optional>
"top_k": <integer, optional, default 20>
@ -146,7 +147,7 @@ the probabilities for every token are returned instead.
When no\_token\_sampling = false:
```json
```
{<token string>: {"p": <number>, "is_end_token": bool, might not be present}}
```
@ -160,14 +161,12 @@ When no\_token\_sampling = false:
When no\_token\_sampling = true:
```json
```
{<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
,<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
,...}
```
Tokens where `p = 0` will not be present in the JSON output.
If you want to implement your own token sampling, you may want to set
`max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own
sampling behavior entirely.

Loading…
Cancel
Save