streaming Inference
Streamed Inference 1
(“stream”=true, return in SSE format):
Streamed Inference 2
(“stream”=true, with configuration “fullTextEnabled”=true, return in SSE format):
Output Description
Table 1
Text Inference Result Description
Parameter Name | Type | Description | ||||
---|---|---|---|---|---|---|
id | string | Request ID. | ||||
object | string | The return result type, currently always “chat.completion”. | ||||
created | integer | Inference request timestamp, accurate to the second. | ||||
model | string | Inference model used. | ||||
choices | list | List of inference results. | ||||
- | index | integer | Choice message index, currently only 0 is allowed. | |||
message | object | Inference message. | ||||
- | role | string | Role, currently always “assistant”. | |||
content | string | Inference text result. | ||||
tool_calls | list | Model tool call output. | ||||
- | function | dict | Function call description. | |||
- | arguments | string | Arguments for calling the function, in JSON string format. | |||
name | string | Name of the called function. | ||||
id | string | Tool call ID for the model’s tool invocation. | ||||
type | string | Tool type, currently only supports “function”. | ||||
finish_reason | string | Reason for completion.
| ||||
usage | object | Inference result statistics data. | ||||
- | prompt_tokens | int | Token length of the user’s input prompt text. | |||
completion_tokens | int | Number of tokens in the inference result. In the PD scenario, it counts the total token number of P and D inference results. When the maximum inference length of a request is set to maxIterTimes, the D node’s response will have completion_tokens equal to maxIterTimes+1, which includes the first token of the P inference result. | ||||
total_tokens | int | Total number of tokens for the request and inference. | ||||
prefill_time | float | Time delay for the first token of inference. | ||||
decode_time_arr | list | Array of decoding time delays for inference. |
Table 2
Streamed Inference Result Description
Parameter Name | Type | Description | |||
---|---|---|---|---|---|
data | object | Result returned from a single inference. | |||
- | id | string | Request ID. | ||
object | string | Currently always returns “chat.completion.chunk”. | |||
created | integer | Inference request timestamp, accurate to the second. | |||
model | string | The inference model used. | |||
full_text | string | Full text result, only available when the configuration item “fullTextEnabled” is set to true. | |||
usage | object | Inference result statistics. | |||
- | prompt_tokens | int | Token length of the user input prompt text. | ||
completion_tokens | int | Number of tokens in the inference result. In PD scenarios, this counts the total tokens from both P and D inference results. When the inference length limit of a request is set to maxIterTimes, the D node response will have a completion_tokens count of maxIterTimes+1, meaning it includes the first token of the P inference result. | |||
total_tokens | int | Total number of tokens for the request and inference. | |||
choices | list | Streaming inference results. | |||
- | index | integer | Choices message index, currently only 0 is supported. | ||
delta | object | Inference result returned, the last response is empty. | |||
- | role | string | Role, currently always returns “assistant”. | ||
content | string | Inference text result. | |||
finish_reason | string | Reason for finishing, only returned in the last inference result.
|