This repository was archived by the owner on Jul 4, 2025. It is now read-only.
This repository was archived by the owner on Jul 4, 2025. It is now read-only.
bug: Cannot start the embedding model via CLI #1719
Open
Description
Cortex version
1.0.3
Describe the issue and expected behaviour
I tried to run the https://huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF for the OpenAI compatible embeddings endpoint, but the model couldn't be started - see the reproduction steps and logs below.
Same thing happens for the https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1.
On the side note, I tried to also run the nomic-embed-text-v1
from the built-in models list but running either cortex pull cortexso/nomic-embed-text-v1
or cortex run nomic-embed-text-v1
results in No variant available
. Seems like a separate issue though.
Steps to Reproduce
# the model gets downloaded just fine
cortex pull mixedbread-ai/mxbai-embed-large-v1
# I can access the info about it without any problems
cortex models get yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf
# But it crashes here with `HTTP error: Failed to read connection`
cortex models start yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf
Screenshots / Logs
The logs for the last crashing command from ~/cortexcpp/logs/cortex.log
(not ~/cortex/logs/
as in ticket template, maybe it has to be updated)
20241124 11:21:38.970271 UTC 69532246 INFO Host: 127.0.0.1 Port: 39281
- main.cc:80
20241124 11:21:38.971643 UTC 69532246 INFO cortex.cpp version: v1.0.3 - main.cc:89
20241124 11:21:38.981916 UTC 69532246 INFO nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.983383 UTC 69532246 INFO Activated GPUs before: - hardware_service.cc:244
20241124 11:21:38.983435 UTC 69532246 INFO Activated GPUs after: - hardware_service.cc:268
20241124 11:21:38.986450 UTC 69532246 INFO Starting worker thread: 0 - download_service.cc:302
20241124 11:21:38.986504 UTC 69532246 INFO Starting worker thread: 1 - download_service.cc:302
20241124 11:21:38.986532 UTC 69532246 INFO Starting worker thread: 2 - download_service.cc:302
20241124 11:21:38.986555 UTC 69532246 INFO Starting worker thread: 3 - download_service.cc:302
20241124 11:21:38.991551 UTC 69532246 INFO nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.994464 UTC 69532246 INFO Server started, listening at: 127.0.0.1:39281 - main.cc:140
20241124 11:21:38.994479 UTC 69532246 INFO Please load your model - main.cc:142
20241124 11:21:38.994494 UTC 69532246 INFO Number of thread is:10 - main.cc:149
20241124 11:21:39.946448 UTC 69532269 INFO Origin: - main.cc:162
20241124 11:21:39.959655 UTC 69532270 INFO {
"ai_prompt" : "[/INST]",
"ai_template" : "[/INST]",
"created" : 0,
"ctx_len" : 512,
"dynatemp_exponent" : 1.0,
"dynatemp_range" : 0.0,
"engine" : "llama-cpp",
"files" :
[
"models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf"
],
"frequency_penalty" : 0.0,
"gpu_arch" : "",
"ignore_eos" : false,
"max_tokens" : 512,
"min_keep" : 0,
"min_p" : 0.05000000074505806,
"mirostat" : false,
"mirostat_eta" : 0.10000000149011612,
"mirostat_tau" : 5.0,
"model" : "yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf",
"model_path" : "/Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf",
"n_parallel" : 1,
"n_probs" : 0,
"name" : "Snowflake-Arctic-Embed-M",
"ngl" : 13,
"object" : "",
"os" : "",
"owned_by" : "",
"penalize_nl" : false,
"precision" : "",
"presence_penalty" : 0.0,
"prompt_template" : "[INST] <<SYS>>\n{system_message}\n<</SYS>>\n{prompt}[/INST]",
"quantization_method" : "",
"repeat_last_n" : 64,
"repeat_penalty" : 1.0,
"seed" : -1,
"size" : 0,
"stop" :
[
"[PAD]"
],
"stream" : true,
"system_prompt" : "[INST] <<SYS>>\n",
"system_template" : "[INST] <<SYS>>\n",
"temperature" : 0.69999998807907104,
"text_model" : false,
"tfs_z" : 1.0,
"top_k" : 40,
"top_p" : 0.94999998807907104,
"typ_p" : 1.0,
"user_prompt" : "\n<</SYS>>\n",
"user_template" : "\n<</SYS>>\n",
"version" : "2"
}
- model_service.cc:667
20241124 11:21:39.974798 UTC 69532270 INFO nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:39.976185 UTC 69532270 INFO is_cuda: 0 - model_service.cc:680
20241124 11:21:39.976265 UTC 69532270 INFO Loading engine: cortex.llamacpp - engine_service.cc:771
20241124 11:21:39.977503 UTC 69532270 INFO Selected engine variant: {"engine":"cortex.llamacpp","variant":"mac-arm64","version":"v0.1.39"} - engine_service.cc:780
20241124 11:21:39.978785 UTC 69532270 INFO Engine path: /Users/grzegorzbielski/cortexcpp/engines/cortex.llamacpp/mac-arm64/v0.1.39 - engine_service.cc:805
20241124 11:21:39.983962 UTC 69532270 INFO cortex.llamacpp version: 0.1.39 - llama_engine.cc:308
20241124 11:21:39.984439 UTC 69532270 INFO Number of parallel is set to 1 - llama_engine.cc:544
20241124 11:21:39.984490 UTC 69532270 INFO system info: {'n_thread': 10, 'total_threads': 10. 'system_info': 'AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | '} - llama_engine.cc:620
20241124 11:21:40.052148 UTC 69532270 INFO llama_load_model_from_file: using device Metal (Apple M1 Max) - 21845 MiB free
- llama_engine.cc:475
20241124 11:21:40.055512 UTC 69532270 INFO llama_model_loader: loaded meta data with 26 key-value pairs and 197 tensors from /Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf (version GGUF V3 (latest))
- llama_engine.cc:475
20241124 11:21:40.055544 UTC 69532270 INFO llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
- llama_engine.cc:475
20241124 11:21:40.055562 UTC 69532270 INFO llama_model_loader: - kv 0: general.architecture str = bert
- llama_engine.cc:475
20241124 11:21:40.055577 UTC 69532270 INFO llama_model_loader: - kv 1: general.type str = model
- llama_engine.cc:475
20241124 11:21:40.055592 UTC 69532270 INFO llama_model_loader: - kv 2: general.name str = Snowflake Arctic Embed M
- llama_engine.cc:475
20241124 11:21:40.055607 UTC 69532270 INFO llama_model_loader: - kv 3: general.size_label str = 109M
- llama_engine.cc:475
20241124 11:21:40.055621 UTC 69532270 INFO llama_model_loader: - kv 4: general.license str = apache-2.0
- llama_engine.cc:475
20241124 11:21:40.055640 UTC 69532270 INFO llama_model_loader: - kv 5: general.tags arr[str,8] = ["sentence-transformers", "feature-ex...
- llama_engine.cc:475
20241124 11:21:40.055655 UTC 69532270 INFO llama_model_loader: - kv 6: bert.block_count u32 = 12
- llama_engine.cc:475
20241124 11:21:40.055669 UTC 69532270 INFO llama_model_loader: - kv 7: bert.context_length u32 = 512
- llama_engine.cc:475
20241124 11:21:40.055683 UTC 69532270 INFO llama_model_loader: - kv 8: bert.embedding_length u32 = 768
- llama_engine.cc:475
20241124 11:21:40.055697 UTC 69532270 INFO llama_model_loader: - kv 9: bert.feed_forward_length u32 = 3072
- llama_engine.cc:475
20241124 11:21:40.055712 UTC 69532270 INFO llama_model_loader: - kv 10: bert.attention.head_count u32 = 12
- llama_engine.cc:475
20241124 11:21:40.055728 UTC 69532270 INFO llama_model_loader: - kv 11: bert.attention.layer_norm_epsilon f32 = 0.000000
- llama_engine.cc:475
20241124 11:21:40.055746 UTC 69532270 INFO llama_model_loader: - kv 12: general.file_type u32 = 1
- llama_engine.cc:475
20241124 11:21:40.055761 UTC 69532270 INFO llama_model_loader: - kv 13: bert.attention.causal bool = false
- llama_engine.cc:475
20241124 11:21:40.055776 UTC 69532270 INFO llama_model_loader: - kv 14: bert.pooling_type u32 = 2
- llama_engine.cc:475
20241124 11:21:40.055792 UTC 69532270 INFO llama_model_loader: - kv 15: tokenizer.ggml.token_type_count u32 = 2
- llama_engine.cc:475
20241124 11:21:40.055808 UTC 69532270 INFO llama_model_loader: - kv 16: tokenizer.ggml.model str = bert
- llama_engine.cc:475
20241124 11:21:40.055823 UTC 69532270 INFO llama_model_loader: - kv 17: tokenizer.ggml.pre str = jina-v2-en
- llama_engine.cc:475
20241124 11:21:40.060101 UTC 69532270 INFO llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
- llama_engine.cc:475
20241124 11:21:40.061260 UTC 69532270 INFO llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
- llama_engine.cc:475
20241124 11:21:40.061277 UTC 69532270 INFO llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100
- llama_engine.cc:475
20241124 11:21:40.061291 UTC 69532270 INFO llama_model_loader: - kv 21: tokenizer.ggml.seperator_token_id u32 = 102
- llama_engine.cc:475
20241124 11:21:40.061304 UTC 69532270 INFO llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 0
- llama_engine.cc:475
20241124 11:21:40.061318 UTC 69532270 INFO llama_model_loader: - kv 23: tokenizer.ggml.cls_token_id u32 = 101
- llama_engine.cc:475
20241124 11:21:40.061332 UTC 69532270 INFO llama_model_loader: - kv 24: tokenizer.ggml.mask_token_id u32 = 103
- llama_engine.cc:475
20241124 11:21:40.061346 UTC 69532270 INFO llama_model_loader: - kv 25: general.quantization_version u32 = 2
- llama_engine.cc:475
20241124 11:21:40.061360 UTC 69532270 INFO llama_model_loader: - type f32: 124 tensors
- llama_engine.cc:475
20241124 11:21:40.061373 UTC 69532270 INFO llama_model_loader: - type f16: 73 tensors
- llama_engine.cc:475
20241124 11:21:40.065239 UTC 69532270 INFO llm_load_vocab: special tokens cache size = 5
- llama_engine.cc:475
20241124 11:21:40.067499 UTC 69532270 INFO llm_load_vocab: token to piece cache size = 0.2032 MB
- llama_engine.cc:475
20241124 11:21:40.067551 UTC 69532270 INFO llm_load_print_meta: format = GGUF V3 (latest)
- llama_engine.cc:475
20241124 11:21:40.067567 UTC 69532270 INFO llm_load_print_meta: arch = bert
- llama_engine.cc:475
20241124 11:21:40.067585 UTC 69532270 INFO llm_load_print_meta: vocab type = WPM
- llama_engine.cc:475
20241124 11:21:40.067598 UTC 69532270 INFO llm_load_print_meta: n_vocab = 30522
- llama_engine.cc:475
20241124 11:21:40.067612 UTC 69532270 INFO llm_load_print_meta: n_merges = 0
- llama_engine.cc:475
20241124 11:21:40.067625 UTC 69532270 INFO llm_load_print_meta: vocab_only = 0
- llama_engine.cc:475
20241124 11:21:40.067638 UTC 69532270 INFO llm_load_print_meta: n_ctx_train = 512
- llama_engine.cc:475
20241124 11:21:40.067653 UTC 69532270 INFO llm_load_print_meta: n_embd = 768
- llama_engine.cc:475
20241124 11:21:40.067667 UTC 69532270 INFO llm_load_print_meta: n_layer = 12
- llama_engine.cc:475
20241124 11:21:40.067685 UTC 69532270 INFO llm_load_print_meta: n_head = 12
- llama_engine.cc:475
20241124 11:21:40.067700 UTC 69532270 INFO llm_load_print_meta: n_head_kv = 12
- llama_engine.cc:475
20241124 11:21:40.067714 UTC 69532270 INFO llm_load_print_meta: n_rot = 64
- llama_engine.cc:475
20241124 11:21:40.067728 UTC 69532270 INFO llm_load_print_meta: n_swa = 0
- llama_engine.cc:475
20241124 11:21:40.067741 UTC 69532270 INFO llm_load_print_meta: n_embd_head_k = 64
- llama_engine.cc:475
20241124 11:21:40.067755 UTC 69532270 INFO llm_load_print_meta: n_embd_head_v = 64
- llama_engine.cc:475
20241124 11:21:40.067769 UTC 69532270 INFO llm_load_print_meta: n_gqa = 1
- llama_engine.cc:475
20241124 11:21:40.067784 UTC 69532270 INFO llm_load_print_meta: n_embd_k_gqa = 768
- llama_engine.cc:475
20241124 11:21:40.067797 UTC 69532270 INFO llm_load_print_meta: n_embd_v_gqa = 768
- llama_engine.cc:475
20241124 11:21:40.067810 UTC 69532270 INFO llm_load_print_meta: f_norm_eps = 1.0e-12
- llama_engine.cc:475
20241124 11:21:40.067822 UTC 69532270 INFO llm_load_print_meta: f_norm_rms_eps = 0.0e+00
- llama_engine.cc:475
20241124 11:21:40.067834 UTC 69532270 INFO llm_load_print_meta: f_clamp_kqv = 0.0e+00
- llama_engine.cc:475
20241124 11:21:40.067846 UTC 69532270 INFO llm_load_print_meta: f_max_alibi_bias = 0.0e+00
- llama_engine.cc:475
20241124 11:21:40.067863 UTC 69532270 INFO llm_load_print_meta: f_logit_scale = 0.0e+00
- llama_engine.cc:475
20241124 11:21:40.067876 UTC 69532270 INFO llm_load_print_meta: n_ff = 3072
- llama_engine.cc:475
20241124 11:21:40.067889 UTC 69532270 INFO llm_load_print_meta: n_expert = 0
- llama_engine.cc:475
20241124 11:21:40.067903 UTC 69532270 INFO llm_load_print_meta: n_expert_used = 0
- llama_engine.cc:475
20241124 11:21:40.067917 UTC 69532270 INFO llm_load_print_meta: causal attn = 0
- llama_engine.cc:475
20241124 11:21:40.067931 UTC 69532270 INFO llm_load_print_meta: pooling type = 2
- llama_engine.cc:475
20241124 11:21:40.067945 UTC 69532270 INFO llm_load_print_meta: rope type = 2
- llama_engine.cc:475
20241124 11:21:40.067958 UTC 69532270 INFO llm_load_print_meta: rope scaling = linear
- llama_engine.cc:475
20241124 11:21:40.067972 UTC 69532270 INFO llm_load_print_meta: freq_base_train = 10000.0
- llama_engine.cc:475
20241124 11:21:40.067986 UTC 69532270 INFO llm_load_print_meta: freq_scale_train = 1
- llama_engine.cc:475
20241124 11:21:40.068000 UTC 69532270 INFO llm_load_print_meta: n_ctx_orig_yarn = 512
- llama_engine.cc:475
20241124 11:21:40.068014 UTC 69532270 INFO llm_load_print_meta: rope_finetuned = unknown
- llama_engine.cc:475
20241124 11:21:40.068027 UTC 69532270 INFO llm_load_print_meta: ssm_d_conv = 0
- llama_engine.cc:475
20241124 11:21:40.068039 UTC 69532270 INFO llm_load_print_meta: ssm_d_inner = 0
- llama_engine.cc:475
20241124 11:21:40.068051 UTC 69532270 INFO llm_load_print_meta: ssm_d_state = 0
- llama_engine.cc:475
20241124 11:21:40.068063 UTC 69532270 INFO llm_load_print_meta: ssm_dt_rank = 0
- llama_engine.cc:475
20241124 11:21:40.068075 UTC 69532270 INFO llm_load_print_meta: ssm_dt_b_c_rms = 0
- llama_engine.cc:475
20241124 11:21:40.068090 UTC 69532270 INFO llm_load_print_meta: model type = 109M
- llama_engine.cc:475
20241124 11:21:40.068108 UTC 69532270 INFO llm_load_print_meta: model ftype = F16
- llama_engine.cc:475
20241124 11:21:40.068120 UTC 69532270 INFO llm_load_print_meta: model params = 108.89 M
- llama_engine.cc:475
20241124 11:21:40.068136 UTC 69532270 INFO llm_load_print_meta: model size = 208.68 MiB (16.08 BPW)
- llama_engine.cc:475
20241124 11:21:40.068149 UTC 69532270 INFO llm_load_print_meta: general.name = Snowflake Arctic Embed M
- llama_engine.cc:475
20241124 11:21:40.068162 UTC 69532270 INFO llm_load_print_meta: UNK token = 100 '[UNK]'
- llama_engine.cc:475
20241124 11:21:40.068175 UTC 69532270 INFO llm_load_print_meta: SEP token = 102 '[SEP]'
- llama_engine.cc:475
20241124 11:21:40.068188 UTC 69532270 INFO llm_load_print_meta: PAD token = 0 '[PAD]'
- llama_engine.cc:475
20241124 11:21:40.068202 UTC 69532270 INFO llm_load_print_meta: CLS token = 101 '[CLS]'
- llama_engine.cc:475
20241124 11:21:40.068215 UTC 69532270 INFO llm_load_print_meta: MASK token = 103 '[MASK]'
- llama_engine.cc:475
20241124 11:21:40.068228 UTC 69532270 INFO llm_load_print_meta: LF token = 0 '[PAD]'
- llama_engine.cc:475
20241124 11:21:40.068242 UTC 69532270 INFO llm_load_print_meta: max token length = 21
- llama_engine.cc:475
20241124 11:21:40.070025 UTC 69532270 INFO llm_load_tensors: offloading 12 repeating layers to GPU
- llama_engine.cc:475
20241124 11:21:40.070056 UTC 69532270 INFO llm_load_tensors: offloading output layer to GPU
- llama_engine.cc:475
20241124 11:21:40.070071 UTC 69532270 INFO llm_load_tensors: offloaded 13/13 layers to GPU
- llama_engine.cc:475
20241124 11:21:40.070087 UTC 69532270 INFO llm_load_tensors: Metal_Mapped model buffer size = 162.46 MiB
- llama_engine.cc:475
20241124 11:21:40.070101 UTC 69532270 INFO llm_load_tensors: CPU_Mapped model buffer size = 46.22 MiB
- llama_engine.cc:475
20241124 11:21:40.070124 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070135 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070144 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070151 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070162 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070171 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070181 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070188 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070199 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070207 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070220 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070229 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070236 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070247 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070257 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070267 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070275 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070285 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070293 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070303 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070313 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070321 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070334 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070344 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070354 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070362 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070374 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070383 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070393 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070403 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070411 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070425 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070444 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070454 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070461 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070472 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070481 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070488 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070497 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070503 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070516 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070523 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070534 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070540 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070552 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070561 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070568 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070577 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070583 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070597 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070610 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070619 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070626 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070638 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070647 UTC 69532270 INFO . - llama_engine.cc:475
20241124 11:21:40.070653 UTC 69532270 INFO
- llama_engine.cc:475
20241124 11:21:40.071012 UTC 69532270 INFO llama_new_context_with_model: n_seq_max = 1
- llama_engine.cc:475
20241124 11:21:40.071023 UTC 69532270 INFO llama_new_context_with_model: n_ctx = 512
- llama_engine.cc:475
20241124 11:21:40.071036 UTC 69532270 INFO llama_new_context_with_model: n_ctx_per_seq = 512
- llama_engine.cc:475
20241124 11:21:40.071046 UTC 69532270 INFO llama_new_context_with_model: n_batch = 2048
- llama_engine.cc:475
20241124 11:21:40.071056 UTC 69532270 INFO llama_new_context_with_model: n_ubatch = 2048
- llama_engine.cc:475
20241124 11:21:40.071066 UTC 69532270 INFO llama_new_context_with_model: flash_attn = 1
- llama_engine.cc:475
20241124 11:21:40.071076 UTC 69532270 INFO llama_new_context_with_model: freq_base = 10000.0
- llama_engine.cc:475
20241124 11:21:40.071087 UTC 69532270 INFO llama_new_context_with_model: freq_scale = 1
- llama_engine.cc:475
20241124 11:21:40.071097 UTC 69532270 INFO ggml_metal_init: allocating
- llama_engine.cc:475
20241124 11:21:40.071112 UTC 69532270 INFO ggml_metal_init: found device: Apple M1 Max
- llama_engine.cc:475
20241124 11:21:40.071128 UTC 69532270 INFO ggml_metal_init: picking default device: Apple M1 Max
- llama_engine.cc:475
20241124 11:21:40.071997 UTC 69532270 INFO ggml_metal_init: using embedded metal library
- llama_engine.cc:475
20241124 11:21:40.076377 UTC 69532270 INFO ggml_metal_init: GPU name: Apple M1 Max
- llama_engine.cc:475
20241124 11:21:40.076396 UTC 69532270 INFO ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
- llama_engine.cc:475
20241124 11:21:40.076409 UTC 69532270 INFO ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
- llama_engine.cc:475
20241124 11:21:40.076421 UTC 69532270 INFO ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
- llama_engine.cc:475
20241124 11:21:40.076432 UTC 69532270 INFO ggml_metal_init: simdgroup reduction = true
- llama_engine.cc:475
20241124 11:21:40.076443 UTC 69532270 INFO ggml_metal_init: simdgroup matrix mul. = true
- llama_engine.cc:475
20241124 11:21:40.076455 UTC 69532270 INFO ggml_metal_init: has bfloat = true
- llama_engine.cc:475
20241124 11:21:40.076466 UTC 69532270 INFO ggml_metal_init: use bfloat = false
- llama_engine.cc:475
20241124 11:21:40.076477 UTC 69532270 INFO ggml_metal_init: hasUnifiedMemory = true
- llama_engine.cc:475
20241124 11:21:40.076489 UTC 69532270 INFO ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
- llama_engine.cc:475
20241124 11:21:40.078759 UTC 69532270 WARN ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
- llama_engine.cc:473
20241124 11:21:40.079510 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
- llama_engine.cc:473
20241124 11:21:40.079530 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
- llama_engine.cc:473
20241124 11:21:40.079540 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
- llama_engine.cc:473
20241124 11:21:40.079550 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
- llama_engine.cc:473
20241124 11:21:40.080422 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
- llama_engine.cc:473
20241124 11:21:40.081204 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
- llama_engine.cc:473
20241124 11:21:40.081922 UTC 69532270 WARN ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083301 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083315 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083324 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083334 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083344 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
- llama_engine.cc:473
20241124 11:21:40.083354 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
- llama_engine.cc:473
20241124 11:21:40.084852 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
- llama_engine.cc:473
20241124 11:21:40.085091 UTC 69532270 WARN ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
- llama_engine.cc:473
20241124 11:21:40.085347 UTC 69532270 WARN ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
- llama_engine.cc:473
20241124 11:21:40.085418 UTC 69532270 WARN ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
- llama_engine.cc:473
20241124 11:21:40.085429 UTC 69532270 WARN ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
- llama_engine.cc:473
20241124 11:21:40.087354 UTC 69532270 INFO llama_kv_cache_init: Metal KV buffer size = 18.00 MiB
- llama_engine.cc:475
20241124 11:21:40.087379 UTC 69532270 INFO llama_new_context_with_model: KV self size = 18.00 MiB, K (f16): 9.00 MiB, V (f16): 9.00 MiB
- llama_engine.cc:475
20241124 11:21:40.087400 UTC 69532270 INFO llama_new_context_with_model: CPU output buffer size = 0.12 MiB
- llama_engine.cc:475
20241124 11:21:40.088241 UTC 69532270 INFO llama_new_context_with_model: Metal compute buffer size = 19.50 MiB
- llama_engine.cc:475
20241124 11:21:40.088261 UTC 69532270 INFO llama_new_context_with_model: CPU compute buffer size = 4.00 MiB
- llama_engine.cc:475
20241124 11:21:40.088270 UTC 69532270 INFO llama_new_context_with_model: graph nodes = 429
- llama_engine.cc:475
20241124 11:21:40.088279 UTC 69532270 INFO llama_new_context_with_model: graph splits = 2
- llama_engine.cc:475
/Users/runner/work/cortex.llamacpp/cortex.llamacpp/llama.cpp/src/llama.cpp:17453: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed
What is your OS?
- Windows
- Mac Silicon
- Mac Intel
- Linux / Ubuntu
What engine are you running?
- cortex.llamacpp (default)
- cortex.tensorrt-llm (Nvidia GPUs)
- cortex.onnx (NPUs, DirectML)
Hardware Specs eg OS version, GPU
Apple M1 Max, Sonoma 14.7