Dynamic Quantization in OpenVINO #25075
Unanswered
Nikitha-Shreyaa
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to perform inference for a few LLM models using OpenVINO. My machine by default supports bfloat16. When I checked for the logs when performing inference for the models, the inference is happened to be at bfloat16. To perform the inference at fp32 I used "INFERENCE_PRECISION_HINT":"f32".
Now I am trying to perform dynamic quantization for "distilbert-base-uncased-finetuned-sst-2-english" model. If I wanted to do so, after getting the int8 weight compressed model, should I load the model as
or
Because when I checked the logs for both the above cases,
in case:1 I was not able to view anything related to dynamic quant but
in case:2 I got logs related to dynamic quant. The below is the part of the log where my doubt arises
by setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_f32::blocked:ab::f0 wei_u8:a:blocked:AB4b32a4b::f0 bia_f32::blocked:a::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 src_dyn_quant_group_size:32;,,mb6ic768oc768,0.0251465
Without setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_bf16,forward_inference,src_bf16::blocked:ab::f0 wei_u8:a:blocked:AB16b64a::f0 bia_bf16::blocked:a::f0 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 ,,mb6ic768oc768,0.0180664
Does quantization has to be done from fp32 only and not from bfloat16?
code.txt
Beta Was this translation helpful? Give feedback.
All reactions