Both `max_new_tokens` (=512) and `max_length`(=518) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. 

### System Info / 系統信息

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

torch                         2.3.1+cu121
torchaudio                    2.3.1+cu121
torchvision                   0.18.1
vector-quantize-pytorch       1.15.3



### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [ ] docker / docker
- [X] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

pip install "xinference[transformers]"

### The command used to start Xinference / 用以启动 xinference 的命令

xinference-local

### Reproduction / 复现过程

与GLM4沟通第一次后，便运行报错，无法继续沟通
--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\handlers.py", line 73, in emit
    if self.shouldRollover(record):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\handlers.py", line 196, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\__init__.py", line 943, in format
    return fmt.format(record)
  File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\__init__.py", line 678, in format
    record.message = record.getMessage()
  File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\87952\miniconda3\envs\xinference\lib\concurrent\futures\thread.py", line 83, in _worker
    work_item.run()
  File "C:\Users\87952\miniconda3\envs\xinference\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xoscar\api.py", line 402, in _wrapper
    return next(_gen)
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\core\model.py", line 318, in _to_json_generator
    for v in gen:
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\model\llm\utils.py", line 558, in _to_chat_completion_chunks
    for i, chunk in enumerate(chunks):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\model\llm\pytorch\chatglm.py", line 259, in _stream_generator
    for chunk_text, _ in self._model.stream_chat(
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py", line 1139, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py", line 1188, in stream_generate
    logger.warn(
Message: 'Both `max_new_tokens` (=512) and `max_length`(=518) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)'
Arguments: (<class 'UserWarning'>,)
C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py:271: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x000001B41194EAF0> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x000001B41194EAF0> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x0000023570BDB750> still has pending operation at deallocation, the process may crash
Traceback (most recent call last):
  File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select
    self._poll(timeout)
RuntimeError: <_overlapped.Overlapped object at 0x0000023570BDB750> still has pending operation at deallocation, the process may crash

### Expected behavior / 期待表现

可以正常运行

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Both `max_new_tokens` (=512) and `max_length`(=518) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. #1872

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Both max_new_tokens (=512) and max_length(=518) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. #1872

Description

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Both `max_new_tokens` (=512) and `max_length`(=518) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. #1872