Usage Examples
Single-turn conversation:
Single-modal model:
Multimodal model:
Note: The value of the “image_url” parameter should be modified according to the actual situation.
Multi-turn conversation
Request Example 1:
Request Example 2:
Response Example:
Text Inference (“stream”=false):
Single-turn conversation:
Multi-turn conversation:
Response Example 1:
Response Example 2:
Streaming Inference:
Streaming Inference 1
(“stream”=true, using sse format return):
Streaming Inference 2
(“stream”=true, with configuration “fullTextEnabled”=true, using sse format return):
Output Explanation
Table 1
Explanation of text inference results
Parameter Name | Type | Description | ||||
---|---|---|---|---|---|---|
id | string | Request ID. | ||||
object | string | The type of the returned result, currently always “chat.completion”. | ||||
created | integer | Inference request timestamp, accurate to the second. | ||||
model | string | The inference model used. | ||||
choices | list | List of inference results. | ||||
- | index | integer | Index of the choices message, currently only 0. | |||
message | object | Inference message. | ||||
- | role | string | Role, currently always returns “assistant”. | |||
content | string | Inference text result. | ||||
tool_calls | list | Model tool invocation output. | ||||
- | function | dict | Function call description. | |||
- | arguments | string | Arguments for the function call, a JSON-formatted string. | |||
name | string | Name of the called function. | ||||
id | string | ID of the model’s tool invocation. | ||||
type | string | Type of the tool, currently only supports “function”. | ||||
finish_reason | string | Reason for completion.
| ||||
usage | object | Inference result statistics data. | ||||
- | prompt_tokens | int | Token length corresponding to the user input prompt text. | |||
completion_tokens | int | Number of tokens in the inference result. In the PD scenario, it counts the total token number of both P and D inference results. When the inference length limit for a request is set to maxIterTimes, the D node response’s completion_tokens count is maxIterTimes+1, i.e., it includes the first token from the P inference result. | ||||
total_tokens | int | Total token count for the request and inference. | ||||
prefill_time | float | Time delay for the first token of the inference. | ||||
decode_time_arr | list | Array of time delays during the inference decoding process. |
Table 2
Explanation of Streaming Inference Results
Parameter Name | Type | Description | |||
---|---|---|---|---|---|
data | object | The result of a single inference. | |||
- | id | string | Request ID. | ||
object | string | Currently returns “chat.completion.chunk”. | |||
created | integer | Inference request timestamp, accurate to the second. | |||
model | string | The inference model used. | |||
full_text | string | Full text result, only returned when the configuration item “fullTextEnabled” is set to true. | |||
usage | object | Inference result statistics. | |||
- | prompt_tokens | int | The token length corresponding to the user-input prompt text. | ||
completion_tokens | int | The number of tokens in the inference result. In PD scenarios, it counts the total tokens of both P and D inference results. When the inference length limit for a request is set to the value of maxIterTimes, the D node response will have a completion_tokens count of maxIterTimes + 1, which adds the first token of the P inference result. | |||
total_tokens | int | The total number of tokens for the request and inference. | |||
choices | list | Streaming inference results. | |||
- | index | integer | The choices message index, which can only be 0 currently. | ||
delta | object | The inference return result, the last response is empty. | |||
- | role | string | The role, currently always returns “assistant”. | ||
content | string | The inference text result. | |||
finish_reason | string | Reason for completion, returned only in the final inference result.
|