-
Notifications
You must be signed in to change notification settings - Fork 58
fix(server): surface chat-template errors #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| import zmq | ||
| from mlx_lm.server import convert_chat, process_message_content | ||
|
|
||
| try: # pragma: no cover - jinja2 is an optional dependency during tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add jinja2 to dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jinja2 is only needed when a tokenizer uses a chat template that hits the Jinja renderer. The executor can run entirely without it (e.g., plain prompts, different tokenizers, or GPU builds in minimal environments), so we treat it as optional to keep install footprints small and avoid importing Jinja just to start the server.
The TemplateError handling is defensive: if jinja2 is present we surface a more precise error type to the HTTP handler; if it isn’t, we fall back to the generic exception. That lets tests and deployments without Jinja still run cleanly while giving better diagnostics where it’s installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want manual install some dependencies, so add it to dependency is a strightforward way.
| break | ||
| if isinstance(token, dict) and token.get("type") == "error": | ||
| yield self._generate_error_stream_chunk(rid, token.get("payload", {})) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The continue after emitting the error chunk is intentional, it keeps the loop alive so the handler can pull whatever comes next (typically the None sentinel) and finish the stream cleanly.
if we change None → break so we exit the loop and send the final chunk + [DONE].
Error dict → emit the SSE error chunk, then continue so we don’t fall through to yield self._generate_stream_chunk(...) with a dict, and we keep waiting for the sentinel.
If we changed that continue to a break, an error would terminate the loop immediately: the client would miss the final [DONE], the request would never mark as finished in the streaming path, and it will leak resources if the sentinel never gets consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The continue after emitting the error chunk is intentional, it keeps the loop alive so the handler can pull whatever comes next (typically the None sentinel) and finish the stream cleanly.↳
if we change None → break so we exit the loop and send the final chunk + [DONE].
Error dict → emit the SSE error chunk, then continue so we don’t fall through to yield self._generate_stream_chunk(...) with a dict, and we keep waiting for the sentinel.
If we changed that continue to a break, an error would terminate the loop immediately: the client would miss the final [DONE], the request would never mark as finished in the streaming path, and it will leak resources if the sentinel never gets consumed.
| detokenizer: StreamingDetokenizer = None | ||
| error_message: Optional[str] = None | ||
| error_type: Optional[str] = None | ||
| error_status: HTTPStatus = HTTPStatus.BAD_REQUEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set default to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not neccesary, error_status is typed as HTTPStatus, and we default it to HTTPStatus.BAD_REQUEST so handle_executor_error() can always assign a valid status (or leave the default) and later create_error_response() can rely on a real HTTPStatus without adding None checks. Switching to None would force us to make the field Optional[HTTPStatus] and add guard code in every consumer, without any functional gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe set default to internal error is better as BAD_REQUEST will make client confused if it is not a bad request
|
123 |
📋 PR Title
fix(server): surface chat-template errors
📝 Change Type
💡 Description
Malformed chat requests that embed
<|channel|>tags inmessages[*].contentcaused the executor’s tokenizer step to raise a template error, leaving the scheduler hanging and never responding to the client. This PR forwards those failures back through the IPC channel so the HTTP handler can immediately return a structured 400 error while keeping the node healthy. It also adds unit coverage for the new HTTP error-handling path.Key Changes
_notify_http_request_errorinExecutorto catch tokenizer/chat-template exceptions and send error envelopes to the HTTP server.HTTPHandlerto track per-request error state, stream error chunks, and emit non-streaming 400 responses.🔗 Related Issues