You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<strong>Browse the Whole Space:</strong> At the top of each task page, see the full range of benchmarked configurations for every model, including different batch sizes, GPU counts, and hardware options. You can <strong>hover oer points</strong> to see exact metrics and configurations. Also, clicking the legend item will open up a <strong>detailed model view</strong>.
88
+
<strong>Browse the Whole Space:</strong> At the top of each task page, see the full range of benchmarked configurations for every model, including different batch sizes, GPU counts, and hardware options. You can <strong>hover over points</strong> to see exact metrics and configurations. Also, clicking the legend item will open up a <strong>detailed model view</strong>.
89
89
</p>
90
90
<p>
91
91
<strong>Set Constraints:</strong> Use the sliders to specify your latency requirement and energy budget. The leaderboard automatically filters and ranks models by their most energy-efficient configuration that meets your constraints. <strong>Each row on the table is clickable</strong>; click to open the detailed model view.
Problem-solving workloads have a distinctive characteristic: <strong>short inputs but very long outputs</strong>. The question itself would typically be brief, but the model generates extensive reasoning chains, sometimes tens of thousands of tokens, to work through the problem.
48
48
</p>
49
49
<p class="text-gray-700 dark:text-gray-300 mb-3">
50
-
This has noticable implications for energy consumption. Long outputs mean the model's context grows throughout generation, which increases memory consumption and reduces how many responses the server can generate in parallel, or in other words, <strong>smaller batch size</strong>. When fewer requests can be processed together, the GPU's computational capacity is less efficiently utilized, making each token more expensive to generate. Combined with the sheer number of output tokens, <strong>energy per response becomes very large</strong> compared to shorter conversational tasks.
50
+
This has noticeable implications for energy consumption. Long outputs mean the model's context grows throughout generation, which increases memory consumption and reduces how many responses the server can generate in parallel, or in other words, <strong>smaller batch size</strong>. When fewer requests can be processed together, the GPU's computational capacity is less efficiently utilized, making each token more expensive to generate. Combined with the sheer number of output tokens, <strong>energy per response becomes very large</strong> compared to shorter conversational tasks.
51
51
</p>
52
52
<p class="text-gray-700 dark:text-gray-300">
53
53
This is why reasoning models like DeepSeek-R1 or Qwen3 with thinking mode show dramatically different energy profiles than the same base models in chat mode.
Text conversation represents a large portion of daily LLM usage today. Both inputs and outputs are relatively short: a typical user message is a sentence or two, and responses are a few paragraphs at most.
78
78
</p>
79
79
<p class="text-gray-700 dark:text-gray-300 mb-3">
80
-
One distinct aspect of conversational AI -- be it text-based or audio-based -- is that a human user is sittin in front of the service, reading or listening to the generated content. However, users do not read or listen at infinitely fast speeds; this creates a natural <strong>loose latency deadline</strong> for generating output tokens, and the server can increase batch size a lot to improve energy-efficiency without negatively impacting user experience. [This paper](https://arxiv.org/abs/2404.16283) explores user experience in more detail.
80
+
One distinct aspect of conversational AI, be it text-based or audio-based, is that a human user is sitting in front of the service, reading or listening to the generated content. However, users do not read or listen at infinitely fast speeds; this creates a natural <strong>loose latency deadline</strong> for generating output tokens, and the server can increase batch size a lot to improve energy-efficiency without negatively impacting user experience. <a href="https://arxiv.org/abs/2404.16283" target="_blank" rel="noopener noreferrer" class="text-blue-600 dark:text-blue-400 hover:underline">This paper</a> explores user experience in more detail.
81
81
</p>
82
82
<p class="text-gray-700 dark:text-gray-300">
83
83
Another difference is that the conversation history accumulates over <strong>multiple turns</strong>, so the model often sees a longer context of accumulated past conversations, rather than just the latest message. However, efficient <strong>prefix caching</strong> allows the server to avoid repeatedly processing the full conversation history for each request, allowing faster and more efficient response generation.
0 commit comments