diff --git a/examples/Context_summarization_with_realtime_api.ipynb b/examples/Context_summarization_with_realtime_api.ipynb index fd2b344cb8..b0c418d8ee 100644 --- a/examples/Context_summarization_with_realtime_api.ipynb +++ b/examples/Context_summarization_with_realtime_api.ipynb @@ -333,6 +333,14 @@ " return resp.choices[0].message.content.strip()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Important implementation detail:\n", + "- The summary is appended as a SYSTEM message rather than an ASSISTANT message. Testing revealed that, during extended conversations, using ASSISTANT messages for summaries can cause the model to mistakenly switch from audio responses to text responses. By using SYSTEM messages for summaries (which can also include additional custom instructions), we clearly signal to the model that these are context-setting instructions, preventing it from incorrectly adopting the modality of the ongoing user-assistant interaction." + ] + }, { "cell_type": "code", "execution_count": 11, @@ -367,8 +375,8 @@ " \"item\": {\n", " \"id\": summary_id,\n", " \"type\": \"message\",\n", - " \"role\": \"assistant\",\n", - " \"content\": [{\"type\": \"text\", \"text\": summary_text}],\n", + " \"role\": \"system\",\n", + " \"content\": [{\"type\": \"input_text\", \"text\": summary_text}],\n", " },\n", " }))\n", "\n", diff --git a/examples/Prompt_migration_guide.ipynb b/examples/Prompt_migration_guide.ipynb new file mode 100644 index 0000000000..33800c6c29 --- /dev/null +++ b/examples/Prompt_migration_guide.ipynb @@ -0,0 +1,891 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a492f6a4", + "metadata": {}, + "source": [ + "# Prompt Migration Guide\n", + "Newer models, such as GPT-4.1, are best in class in performance and instruction following. As model gets smarter, there is a consistent need to adapt prompts that were originally tailored to earlier models' limitations, ensuring they remain effective and clear for newer generations.\n", + "\n", + "Models such as GPT‑4.1 excel at closely following instructions, but this precision means it can interpret unclear or poorly phrased instructions **literally**, leading to unexpected or incorrect results. To leverage GPT‑4.1's full potential, it's essential to refine prompts, ensuring each instruction is explicit, unambiguous, and aligned with your intended outcomes.\n", + "\n", + "---\n", + "\n", + "Example of Unclear Instructions:\n", + "\n", + "- Ambiguous:\n", + "\n", + "> \"\"Do not include irrelevant information.\"\"\n", + "\n", + "Issue: GPT-4.1 might struggle to determine what is \"irrelevant\" if not explicitly defined. This could cause it to omit essential details due to overly cautious interpretation or include too much detail inadvertently..\n", + "\n", + "- Improved:\n", + "\n", + "> \"Only include facts directly related to the main topic (X). Exclude personal anecdotes, unrelated historical context, or side discussions.\"\n", + "\n", + "---\n", + "\n", + "**Objective**: This interactive notebook helps you improve an existing prompt (written for another model) into one that is clear, unambiguous and optimised for GPT‑4.1 following best practices.\n", + "\n", + "**Workflow Overview** \n", + "This notebook uses the following approach:\n", + "\n", + "- [Step 1. Input your original prompt](#step-1-input-your-original-prompt) \n", + "- [Step 2. Identify all instructions in your prompt](#step-2-identify-all-instructions-in-your-prompt) \n", + "- [Step 3. Ask GPT-4.1 to *critique* the prompt](#step-3-ask-gpt-4-1-to-critique-the-prompt) \n", + "- [Step 4. Auto-generate a revised system prompt](#step-4-auto-generate-a-revised-system-prompt) \n", + "- [Step 5. Evaluate and iterate](#step-5-evaluate-and-iterate) \n", + "- [Step 6. (Optional) Automatically apply GPT-4.1 best practices](#step-6-optional-automatically-apply-gpt-4-1-best-practices)\n", + "\n", + "**Prerequisites**\n", + "- The `openai` Python package and `OPENAI_API_KEY`" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "7e0c4360", + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install openai pydantic tiktoken" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "819c46c4", + "metadata": {}, + "outputs": [], + "source": [ + "# Imports & API connection\n", + "from openai import OpenAI\n", + "from pydantic import BaseModel, Field\n", + "from typing import Any, Dict, Iterable, List, Optional\n", + "import tiktoken\n", + "import html\n", + "from html import escape \n", + "import difflib\n", + "import sys\n", + "\n", + "from IPython.display import display, HTML\n", + "\n", + "try:\n", + " from IPython.display import HTML, display\n", + " _IN_IPYTHON = True\n", + "except ImportError:\n", + " _IN_IPYTHON = False\n", + "\n", + "\n", + "client = OpenAI()\n", + "\n", + "MODEL = \"gpt-4.1\"" + ] + }, + { + "cell_type": "markdown", + "id": "546cbcd8", + "metadata": {}, + "source": [ + "Below are a few helper functions to enable us to easily review the analysis and modifications on our prompt." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f924a813", + "metadata": {}, + "outputs": [], + "source": [ + "_COLORS = {\n", + " '+': (\"#d2f5d6\", \"#22863a\"), # additions (green)\n", + " '-': (\"#f8d7da\", \"#b31d28\"), # deletions (red)\n", + " '@': (None, \"#6f42c1\"), # hunk header (purple)\n", + "}\n", + "\n", + "def _css(**rules: str) -> str:\n", + " \"\"\"Convert kwargs to a CSS string (snake_case → kebab-case).\"\"\"\n", + " return \";\".join(f\"{k.replace('_', '-')}: {v}\" for k, v in rules.items())\n", + "\n", + "def _render(html_str: str) -> None:\n", + " \"\"\"Render inside Jupyter if available, else print to stdout.\"\"\"\n", + " try:\n", + " display # type: ignore[name-defined]\n", + " from IPython.display import HTML # noqa: WPS433\n", + " display(HTML(html_str))\n", + " except NameError:\n", + " print(html_str, flush=True)\n", + "\n", + "# ---------- diff helpers ------------------------------------------------------\n", + "\n", + "def _style(line: str) -> str:\n", + " \"\"\"Wrap a diff line in a with optional colors.\"\"\"\n", + " bg, fg = _COLORS.get(line[:1], (None, None))\n", + " css = \";\".join(s for s in (f\"background:{bg}\" if bg else \"\",\n", + " f\"color:{fg}\" if fg else \"\") if s)\n", + " return f'{html.escape(line)}'\n", + "\n", + "def _wrap(lines: Iterable[str]) -> str:\n", + " body = \"
\".join(lines)\n", + " return (\n", + " \"
\"\n", + " \"🕵️‍♂️ Critique & Diff (click to expand)\"\n", + " f'
{body}
'\n", + " \"
\"\n", + " )\n", + "\n", + "def show_critique_and_diff(old: str, new: str) -> str:\n", + " \"\"\"Display & return a GitHub-style HTML diff between *old* and *new*.\"\"\"\n", + " diff = difflib.unified_diff(old.splitlines(), new.splitlines(),\n", + " fromfile=\"old\", tofile=\"new\", lineterm=\"\")\n", + " html_block = _wrap(map(_style, diff))\n", + " _render(html_block)\n", + " return html_block\n", + "\n", + "# ---------- “card” helpers ----------------------------------------------------\n", + "\n", + "CARD = _css(background=\"#f8f9fa\", border_radius=\"8px\", padding=\"18px 22px\",\n", + " margin_bottom=\"18px\", border=\"1px solid #e0e0e0\",\n", + " box_shadow=\"0 1px 4px #0001\")\n", + "TITLE = _css(font_weight=\"600\", font_size=\"1.1em\", color=\"#2d3748\",\n", + " margin_bottom=\"6px\")\n", + "LABEL = _css(color=\"#718096\", font_size=\"0.95em\", font_weight=\"500\",\n", + " margin_right=\"6px\")\n", + "EXTRACT = _css(font_family=\"monospace\", background=\"#f1f5f9\", padding=\"7px 10px\",\n", + " border_radius=\"5px\", display=\"block\", margin_top=\"3px\",\n", + " white_space=\"pre-wrap\", color=\"#1a202c\")\n", + "\n", + "def display_cards(\n", + " items: Iterable[Any],\n", + " *,\n", + " title_attr: str,\n", + " field_labels: Optional[Dict[str, str]] = None,\n", + " card_title_prefix: str = \"Item\",\n", + ") -> None:\n", + " \"\"\"Render objects as HTML “cards” (or plaintext when not in IPython).\"\"\"\n", + " items = list(items)\n", + " if not items:\n", + " _render(\"No data to display.\")\n", + " return\n", + "\n", + " # auto-derive field labels if none supplied\n", + " if field_labels is None:\n", + " sample = items[0]\n", + " field_labels = {\n", + " a: a.replace(\"_\", \" \").title()\n", + " for a in dir(sample)\n", + " if not a.startswith(\"_\")\n", + " and not callable(getattr(sample, a))\n", + " and a != title_attr\n", + " }\n", + "\n", + " cards = []\n", + " for idx, obj in enumerate(items, 1):\n", + " title_html = html.escape(str(getattr(obj, title_attr, \"\")))\n", + " rows = [f'
{card_title_prefix} {idx}: {title_html}
']\n", + "\n", + " for attr, label in field_labels.items():\n", + " value = getattr(obj, attr, None)\n", + " if value is None:\n", + " continue\n", + " rows.append(\n", + " f'
{html.escape(label)}:'\n", + " f'{html.escape(str(value))}
'\n", + " )\n", + "\n", + " cards.append(f'
{\"\".join(rows)}
')\n", + "\n", + " _render(\"\\n\".join(cards))" + ] + }, + { + "cell_type": "markdown", + "id": "f3163f30", + "metadata": {}, + "source": [ + "## Step 1. Input Your Original Prompt\n", + "Begin by providing your existing prompt clearly between triple quotes (\"\"\"). This prompt will serve as the baseline for improvement.\n", + "\n", + "For this example, we will be using the system prompt for LLM-as-a-Judge provided in the following [paper](https://arxiv.org/pdf/2306.05685)." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "28a47cc1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original prompt length: 243 tokens\n" + ] + } + ], + "source": [ + "original_prompt = \"\"\"\n", + "[System]\n", + "Please act as an impartial judge and evaluate the quality of the responses provided by two\n", + "AI assistants to the user question displayed below. You should choose the assistant that\n", + "follows the user’s instructions and answers the user’s question better. Your evaluation\n", + "should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,\n", + "and level of detail of their responses. Begin your evaluation by comparing the two\n", + "responses and provide a short explanation. Avoid any position biases and ensure that the\n", + "order in which the responses were presented does not influence your decision. Do not allow\n", + "the length of the responses to influence your evaluation. Do not favor certain names of\n", + "the assistants. Be as objective as possible. After providing your explanation, output your\n", + "final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\"\n", + "if assistant B is better, and \"[[C]]\" for a tie.\n", + "\n", + "[User Question]\n", + "{question}\n", + "\n", + "[The Start of Assistant A’s Answer]\n", + "{answer_a}\n", + "[The End of Assistant A’s Answer]\n", + "\n", + "[The Start of Assistant B’s Answer]\n", + "{answer_b}\n", + "[The End of Assistant B’s Answer]\n", + "\"\"\"\n", + "\n", + "encoding = tiktoken.encoding_for_model(\"gpt-4\")\n", + "num_tokens = len(encoding.encode(original_prompt))\n", + "print(\"Original prompt length:\", num_tokens, \"tokens\")" + ] + }, + { + "cell_type": "markdown", + "id": "b7cea51e", + "metadata": {}, + "source": [ + "## Step 2. Identify All Instructions in your Prompt\n", + "In this section, we will extract every INSTRUCTION that the LLM identifies within the system prompt. This allows you to review the list, spot any statements that should not be instructions, and clarify any that are ambiguous.\n", + "\n", + "Carefully review and confirm that each listed instruction is both accurate and essential to retain." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "ae7cbbce", + "metadata": {}, + "outputs": [], + "source": [ + "class Instruction(BaseModel):\n", + " instruction_title: str = Field(description=\"A 2-8 word title of the instruction that the LLM has to follow.\")\n", + " extracted_instruction: str = Field(description=\"The exact text that was extracted from the system prompt that the instruction is derived from.\")\n", + "\n", + "class InstructionList(BaseModel):\n", + " instructions: list[Instruction] = Field(description=\"A list of instructions and their corresponding extracted text that the LLM has to follow.\")\n", + "\n", + "\n", + "EXTRACT_INSTRUCTIONS_SYSTEM_PROMPT = \"\"\"\n", + "## Role & Objective\n", + "You are an **Instruction-Extraction Assistant**. \n", + "Your job is to read a System Prompt provided by the user and distill the **mandatory instructions** the target LLM must obey.\n", + "\n", + "## Instructions\n", + "1. **Identify Mandatory Instructions** \n", + " • Locate every instruction in the System Prompt that the LLM is explicitly required to follow. \n", + " • Ignore suggestions, best-practice tips, or optional guidance.\n", + "\n", + "2. **Generate Rules** \n", + " • Re-express each mandatory instruction as a clear, concise rule.\n", + " • Provide the extracted text that the instruction is derived from.\n", + " • Each rule must be standalone and imperative.\n", + "\n", + "## Output Format\n", + "Return a json object with a list of instructions which contains an instruction_title and their corresponding extracted text that the LLM has to follow. Do not include any other text or comments.\n", + "\n", + "## Constraints\n", + "- Include **only** rules that the System Prompt explicitly enforces. \n", + "- Omit any guidance that is merely encouraged, implied, or optional. \n", + "\"\"\"\n", + "\n", + "response = client.responses.parse(\n", + " model=MODEL,\n", + " input=\"SYSTEM_PROMPT TO ANALYZE: \" + original_prompt,\n", + " instructions=EXTRACT_INSTRUCTIONS_SYSTEM_PROMPT,\n", + " temperature=0.0,\n", + " text_format=InstructionList,\n", + ")\n", + "\n", + "instructions_list = response.output_parsed" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "0985a544", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Instruction 1: Act as an impartial judge
Extracted Text:Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below.
\n", + "
Instruction 2: Choose the better assistant
Extracted Text:You should choose the assistant that follows the user’s instructions and answers the user’s question better.
\n", + "
Instruction 3: Consider specific evaluation factors
Extracted Text:Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses.
\n", + "
Instruction 4: Begin with a comparison and explanation
Extracted Text:Begin your evaluation by comparing the two responses and provide a short explanation.
\n", + "
Instruction 5: Avoid position biases
Extracted Text:Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
\n", + "
Instruction 6: Do not let response length influence evaluation
Extracted Text:Do not allow the length of the responses to influence your evaluation.
\n", + "
Instruction 7: Do not favor assistant names
Extracted Text:Do not favor certain names of the assistants.
\n", + "
Instruction 8: Be objective
Extracted Text:Be as objective as possible.
\n", + "
Instruction 9: Output final verdict in strict format
Extracted Text:After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_cards(\n", + " instructions_list.instructions,\n", + " title_attr=\"instruction_title\",\n", + " field_labels={\"extracted_instruction\": \"Extracted Text\"},\n", + " card_title_prefix=\"Instruction\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "93e4347d", + "metadata": {}, + "source": [ + "It's helpful to examine which parts of your prompt the model recognizes as instructions. Instructions are how we \"program\" models using natural language, so it's crucial to ensure they're clear, precise, and correct." + ] + }, + { + "cell_type": "markdown", + "id": "7db52dd3", + "metadata": {}, + "source": [ + "## Step 3. Ask GPT-4.1 to *critique* the prompt\n", + "Next, GPT‑4.1 itself will critique the original prompt, specifically identifying areas that may cause confusion or errors:\n", + "\n", + "- Ambiguity: Phrases open to multiple interpretations.\n", + "\n", + "- Lacking Definitions: Labels or terms that are not clearly defined, which may cause the model to infer or guess their intended meaning.\n", + "\n", + "- Conflicting Instructions: Rules or conditions that contradict or overlap.\n", + "\n", + "- Missing Context or Assumptions: Necessary information or context not explicitly provided.\n", + "\n", + "The critique output will be clearly organized, highlighting specific issues along with actionable suggestions for improvement.\n", + "\n", + "Models are really good at **identifying parts of a prompt that they find ambiguous or confusing**. By addressing these issues, we can engineer the instructions to make them clearer and more effective for the model." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "96af823d", + "metadata": {}, + "outputs": [], + "source": [ + "class CritiqueIssue(BaseModel):\n", + " issue: str\n", + " snippet: str\n", + " explanation: str\n", + " suggestion: str\n", + "\n", + "class CritiqueIssues(BaseModel):\n", + " issues: List[CritiqueIssue] = Field(..., min_length=1, max_length=6)\n", + " \n", + "CRITIQUE_SYSTEM_PROMPT = \"\"\"\n", + "## Role & Objective \n", + "You are a **Prompt-Critique Assistant**.\n", + "Examine a user-supplied LLM prompt (targeting GPT-4.1 or compatible) and surface any weaknesses.\n", + "\n", + "## Instructions\n", + "Check for the following issues:\n", + "- Ambiguity: Could any wording be interpreted in more than one way?\n", + "- Lacking Definitions: Are there any class labels, terms, or concepts that are not defined that might be misinterpreted by an LLM?\n", + "- Conflicting, missing, or vague instructions: Are directions incomplete or contradictory?\n", + "- Unstated assumptions: Does the prompt assume the model has to be able to do something that is not explicitly stated?\n", + "\n", + "## Do **NOT** list issues of the following types:\n", + "- Invent new instructions, tool calls, or external information. You do not know what tools need to be added that are missing.\n", + "- Issues that you are not sure about.\n", + "\n", + "## Output Format \n", + "Return a JSON **array** (not an object) with 1-6 items, each following this schema:\n", + "\n", + "```json\n", + "{\n", + " \"issue\": \"<1-6 word label>\",\n", + " \"snippet\": \"<≤50-word excerpt>\",\n", + " \"explanation\":\"\",\n", + " \"suggestion\": \"\"\n", + "}\n", + "Return a JSON array of these objects. If the prompt is already clear, complete, and effective, return an empty list: `[]`.\n", + "\"\"\"\n", + "\n", + "CRITIQUE_USER_PROMPT = f\"\"\"\n", + "Evaluate the following prompt for clarity, completeness, and effectiveness:\n", + "###\n", + "{original_prompt}\n", + "###\n", + "Return your critique using the specified JSON format only.\n", + "\"\"\"\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "d0697e49", + "metadata": {}, + "outputs": [], + "source": [ + "response = client.responses.parse(\n", + " model=MODEL,\n", + " input=[\n", + " {\"role\": \"system\", \"content\": CRITIQUE_SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": CRITIQUE_USER_PROMPT},\n", + " ],\n", + " temperature=0.0,\n", + " text_format=CritiqueIssues,\n", + ")\n", + "\n", + "critique = response.output_parsed" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "2cfb2877", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Issue 1: Ambiguous evaluation criteria
Snippet:consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail
Explanation:The prompt lists several evaluation factors but does not define them or explain how to weigh them. This could lead to inconsistent or subjective judgments.
Suggestion:Provide clear definitions for each criterion and specify if any should be prioritized over others.
\n", + "
Issue 2: Unclear handling of ties
Snippet:"[[C]]" for a tie
Explanation:The prompt allows for a tie verdict but does not specify under what circumstances a tie is appropriate, which may lead to inconsistent use.
Suggestion:Clarify when a tie should be chosen, e.g., if both responses are equally strong across all criteria.
\n", + "
Issue 3: Potential ambiguity in 'objectivity'
Snippet:Be as objective as possible.
Explanation:The prompt asks for objectivity but does not specify what constitutes objectivity in this context, especially given the subjective nature of some criteria.
Suggestion:Define what is meant by objectivity in this evaluation context, possibly by referencing adherence to the listed criteria.
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_cards(\n", + " critique.issues,\n", + " title_attr=\"issue\",\n", + " field_labels={\n", + " \"snippet\": \"Snippet\",\n", + " \"explanation\": \"Explanation\",\n", + " \"suggestion\": \"Suggestion\"\n", + " },\n", + " card_title_prefix=\"Issue\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "572d4591", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Issue: Ambiguous evaluation criteria\n", + "Snippet: consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail\n", + "Explanation: The prompt lists several evaluation factors but does not define them or explain how to weigh them. This could lead to inconsistent or subjective judgments.\n", + "Suggestion: Provide clear definitions for each criterion and specify if any should be prioritized over others.\n", + "\n", + "Issue: Unclear handling of ties\n", + "Snippet: \"[[C]]\" for a tie\n", + "Explanation: The prompt allows for a tie verdict but does not specify under what circumstances a tie is appropriate, which may lead to inconsistent use.\n", + "Suggestion: Clarify when a tie should be chosen, e.g., if both responses are equally strong across all criteria.\n", + "\n", + "Issue: Potential ambiguity in 'objectivity'\n", + "Snippet: Be as objective as possible.\n", + "Explanation: The prompt asks for objectivity but does not specify what constitutes objectivity in this context, especially given the subjective nature of some criteria.\n", + "Suggestion: Define what is meant by objectivity in this evaluation context, possibly by referencing adherence to the listed criteria.\n", + "\n" + ] + } + ], + "source": [ + "# Create a string of the issues\n", + "issues_str = \"\\n\".join(\n", + " f\"Issue: {issue.issue}\\nSnippet: {issue.snippet}\\nExplanation: {issue.explanation}\\nSuggestion: {issue.suggestion}\\n\"\n", + " for issue in critique.issues\n", + ")\n", + "\n", + "print(issues_str)" + ] + }, + { + "cell_type": "markdown", + "id": "b6ef8ed9", + "metadata": {}, + "source": [ + "Review the list of issues:\n", + "- If you are satisfied with them, proceed to next step #4. \n", + "- If you believe some issues are not relevant, copy the above text into the next cell and remove those issues. In this case, all three issues make reasonable sense, so we skip this step." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "54ce81bd", + "metadata": {}, + "outputs": [], + "source": [ + "# issues_str = \"\"\"\n", + "# PLACEHOLDER FOR ISSUES YOU WANT TO CORRECT, DO NOT RUN THIS CELL UNLESS YOU HAVE COPY-PASTED THE ISSUES FROM ABOVE\n", + "# \"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "acab10b4", + "metadata": {}, + "source": [ + "## Step 4. Auto‑generate a revised *system* prompt\n", + "We now feed the critique back to GPT‑4.1 and ask it to produce an improved version of the original prompt, ready to drop into a `system` role message." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "9a2980a1", + "metadata": {}, + "outputs": [], + "source": [ + "REVISE_SYSTEM_PROMPT = \"\"\"\n", + "## Role & Objective \n", + "Revise the user’s original prompt to resolve most of the listed issues, while preserving the original wording and structure as much as possible.\n", + "\n", + "## Instructions\n", + "1. Carefully review the original prompt and the list of issues.\n", + "2. Apply targeted edits directly addressing the listed issues. The edits should be as minimal as possible while still addressing the issue.\n", + "3. Do not introduce new content or make assumptions beyond the provided information.\n", + "4. Maintain the original structure and format of the prompt.\n", + "\n", + "## Output Format\n", + "Return only the fully revised prompt. Do not include commentary, summaries, or code fences.\n", + "\"\"\"\n", + "\n", + "REVISE_USER_PROMPT = f\"\"\"\n", + "Here is the original prompt:\n", + "---\n", + "{original_prompt}\n", + "---\n", + "\n", + "Here are the issues to fix:\n", + "{issues_str}\n", + "\n", + "Please return **only** the fully revised prompt. Do not include commentary, summaries, or explanations.\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "f90e43df", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "🔄 Revised prompt:\n", + "------------------\n", + "[System]\n", + "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should be based on the following criteria:\n", + "\n", + "- Helpfulness: The extent to which the response addresses the user’s needs and provides useful information.\n", + "- Relevance: How closely the response pertains to the user’s question and instructions.\n", + "- Accuracy: The correctness and factual reliability of the information provided.\n", + "- Depth: The level of insight, explanation, or reasoning demonstrated in the response.\n", + "- Creativity: The originality or resourcefulness shown in addressing the question, where appropriate.\n", + "- Level of Detail: The thoroughness and completeness of the response.\n", + "\n", + "All criteria should be considered equally unless the user’s instructions indicate otherwise. \n", + "\n", + "Begin your evaluation by comparing the two responses according to these criteria and provide a short explanation. Remain impartial by avoiding any position biases and ensuring that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses or the names of the assistants to influence your evaluation.\n", + "\n", + "Be as objective as possible by strictly adhering to the defined criteria above and basing your judgment solely on how well each response meets them.\n", + "\n", + "After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie. Choose \"[[C]]\" only if both responses are equally strong across all criteria.\n", + "\n", + "[User Question]\n", + "{question}\n", + "\n", + "[The Start of Assistant A’s Answer]\n", + "{answer_a}\n", + "[The End of Assistant A’s Answer]\n", + "\n", + "[The Start of Assistant B’s Answer]\n", + "{answer_b}\n", + "[The End of Assistant B’s Answer]\n" + ] + } + ], + "source": [ + "revised_response = client.responses.create(\n", + " model=MODEL,\n", + " input=REVISE_USER_PROMPT,\n", + " instructions=REVISE_SYSTEM_PROMPT,\n", + " temperature=0.0\n", + ")\n", + "\n", + "revised_prompt = revised_response.output_text\n", + "print(\"\\n🔄 Revised prompt:\\n------------------\")\n", + "print(revised_response.output_text)" + ] + }, + { + "cell_type": "markdown", + "id": "cc98a4ff", + "metadata": {}, + "source": [ + "Let's review the changes side-by-side comparison highlighting changes between the improved and refined prompts:" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "a6dc093c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
🕵️‍♂️ Critique & Diff (click to expand)
--- old
+++ new
@@ -1,15 +1,20 @@
[System]
-Please act as an impartial judge and evaluate the quality of the responses provided by two
-AI assistants to the user question displayed below. You should choose the assistant that
-follows the user’s instructions and answers the user’s question better. Your evaluation
-should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
-and level of detail of their responses. Begin your evaluation by comparing the two
-responses and provide a short explanation. Avoid any position biases and ensure that the
-order in which the responses were presented does not influence your decision. Do not allow
-the length of the responses to influence your evaluation. Do not favor certain names of
-the assistants. Be as objective as possible. After providing your explanation, output your
-final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
-if assistant B is better, and "[[C]]" for a tie.
+Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should be based on the following criteria:
+
+- Helpfulness: The extent to which the response addresses the user’s needs and provides useful information.
+- Relevance: How closely the response pertains to the user’s question and instructions.
+- Accuracy: The correctness and factual reliability of the information provided.
+- Depth: The level of insight, explanation, or reasoning demonstrated in the response.
+- Creativity: The originality or resourcefulness shown in addressing the question, where appropriate.
+- Level of Detail: The thoroughness and completeness of the response.
+
+All criteria should be considered equally unless the user’s instructions indicate otherwise.
+
+Begin your evaluation by comparing the two responses according to these criteria and provide a short explanation. Remain impartial by avoiding any position biases and ensuring that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses or the names of the assistants to influence your evaluation.
+
+Be as objective as possible by strictly adhering to the defined criteria above and basing your judgment solely on how well each response meets them.
+
+After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie. Choose "[[C]]" only if both responses are equally strong across all criteria.

[User Question]
{question}
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "show_critique_and_diff(original_prompt, revised_prompt)" + ] + }, + { + "cell_type": "markdown", + "id": "edf2edf5", + "metadata": {}, + "source": [ + "## Step 5. Evaluate and iterate\n", + "Finally, evaluate your refined prompt by:\n", + "\n", + "- Testing it with representative evaluation examples or data.\n", + "\n", + "- Analyzing the responses to ensure desired outcomes.\n", + "\n", + "- Iterating through previous steps if further improvements are required.\n", + "\n", + "Consistent testing and refinement ensure your prompts consistently achieve their intended results." + ] + }, + { + "cell_type": "markdown", + "id": "c3ed1776", + "metadata": {}, + "source": [ + "## Step 6. (OPTIONAL) Automatically Apply GPT‑4.1 Best Practices\n", + "\n", + "In this step, GPT-4.1 best practices will be applied automatically to enhance your original prompt. We strongly suggest to manually review the edits made and decide if you want to keep or not.\n", + "\n", + "See the [4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide) for reference." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "02951cd9", + "metadata": {}, + "outputs": [], + "source": [ + "BEST_PRACTICES_SYSTEM_PROMPT = \"\"\"\n", + "## Task\n", + "Your task is to take a **Baseline Prompt** (provided by the user) and output a **Revised Prompt** that keeps the original wording and order as intact as possible **while surgically inserting improvements that follow the “GPT‑4.1 Best Practices” reference**.\n", + "\n", + "## How to Edit\n", + "1. **Keep original text** — Only remove something if it directly goes against a best practice. Otherwise, keep the wording, order, and examples as they are.\n", + "2. **Add best practices only when clearly helpful.** If a guideline doesn’t fit the prompt or its use case (e.g., diff‑format guidance on a non‑coding prompt), just leave that part of the prompt unchanged.\n", + "3. **Where to add improvements** (use Markdown `#` headings):\n", + " - At the very top, add *Agentic Reminders* (like Persistence, Tool-calling, or Planning) — only if relevant. Don’t add these if the prompt doesn’t require agentic behavior (agentic means prompts that involve planning or running tools for a while).\n", + " - When adding sections, follow this order if possible. If some sections do not make sense, don't add them:\n", + " 1. `# Role & Objective` \n", + " - State who the model is supposed to be (the role) and what its main goal is.\n", + " 2. `# Instructions` \n", + " - List the steps, rules, or actions the model should follow to complete the task.\n", + " 3. *(Any sub-sections)* \n", + " - Include any extra sections such as sub-instructions, notes or guidelines already in the prompt that don’t fit into the main categories.\n", + " 4. `# Reasoning Steps` \n", + " - Explain the step-by-step thinking or logic the model should use when working through the task.\n", + " 5. `# Output Format` \n", + " - Describe exactly how the answer should be structured or formatted (e.g., what sections to include, how to label things, or what style to use).\n", + " 6. `# Examples` \n", + " - Provide sample questions and answers or sample outputs to show the model what a good response looks like.\n", + " 7. `# Context` \n", + " - Supply any background information, retrieved context, or extra details that help the model understand the task better.\n", + " - Don’t introduce new sections that don’t exist in the Baseline Prompt. For example, if there’s no `# Examples` or no `# Context` section, don’t add one.\n", + "4. If the prompt is for long context analysis or long tool use, repeat key Agentic Reminders, Important Reminders and Output Format points at the end.\n", + "5. If there are class labels, evaluation criterias or key concepts, add a definition to each to define them concretely.\n", + "5. Add a chain-of-thought trigger at the end of main instructions (like “Think step by step...”), unless one is already there or it would be repetitive.\n", + "6. For prompts involving tools or sample phrases, add Failure-mode bullets:\n", + " - “If you don’t have enough info to use a tool, ask the user first.”\n", + " - “Vary sample phrases to avoid repetition.”\n", + "7. Match the original tone (formal or casual) in anything you add.\n", + "8. **Only output the full Revised Prompt** — no explanations, comments, or diffs. Do not output \"keep the original...\", you need to fully output the prompt, no shortcuts.\n", + "9. Do not delete any sections or parts that are useful and add value to the prompt and doesn't go against the best practices.\n", + "10. **Self-check before sending:** Make sure there are no typos, duplicated lines, missing headings, or missed steps.\n", + "\n", + "\n", + "## GPT‑4.1 Best Practices Reference \n", + "1. **Persistence reminder**: Explicitly instructs the model to continue working until the user's request is fully resolved, ensuring the model does not stop early.\n", + "2. **Tool‑calling reminder**: Clearly tells the model to use available tools or functions instead of making assumptions or guesses, which reduces hallucinations.\n", + "3. **Planning reminder**: Directs the model to create a step‑by‑step plan and reflect before and after tool calls, leading to more accurate and thoughtful output.\n", + "4. **Scaffold structure**: Requires a consistent and predictable heading order (e.g., Role, Instructions, Output Format) to make prompts easier to maintain.\n", + "5. **Instruction placement (long context)**: Ensures that key instructions are duplicated or placed strategically so they remain visible and effective in very long prompts.\n", + "6. **Chain‑of‑thought trigger**: Adds a phrase that encourages the model to reason step by step, which improves logical and thorough responses.\n", + "7. **Instruction‑conflict hygiene**: Checks for and removes any contradictory instructions, ensuring that the most recent or relevant rule takes precedence.\n", + "8. **Failure‑mode mitigations**: Adds safeguards against common errors, such as making empty tool calls or repeating phrases, to improve reliability.\n", + "9. **Diff / code‑edit format**: Specifies a robust, line‑number‑free diff or code‑edit style for output, making changes clear and easy to apply.\n", + "10. **Label Definitions**: Defines all the key labels or terms that are used in the prompt so that the model knows what they mean.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "c23ac8b3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Improved prompt:\n", + "\n", + "# Role & Objective\n", + "You are an impartial judge. Your goal is to determine which of two AI assistant answers better fulfills the user’s request.\n", + "\n", + "# Instructions \n", + "Follow the steps below exactly and remain strictly neutral:\n", + "\n", + "1. Read the User Question and both assistant answers in full. \n", + "2. Evaluate each answer against **all** six criteria, treating them with equal weight unless the user explicitly states otherwise:\n", + " • Helpfulness – Does the response address the user’s needs and provide useful information? \n", + " • Relevance – How closely does the response pertain to the user’s question and instructions? \n", + " • Accuracy – Is the information correct and factually reliable? \n", + " • Depth – Does the answer show insight, explanation, or reasoning? \n", + " • Creativity – Is the approach original or resourceful when appropriate? \n", + " • Level of Detail – Is the response thorough and complete? \n", + "3. Stay impartial: \n", + " • Ignore the order in which the answers appear. \n", + " • Ignore the length of each answer. \n", + " • Ignore the assistants’ names. \n", + "4. Make your decision solely on how well each response meets the criteria above. \n", + "5. After your analysis, produce a final verdict using the exact format in the Output Format section.\n", + "\n", + "# Reasoning Steps\n", + "Think step by step:\n", + "1. For each criterion, briefly note strengths and weaknesses for Assistant A. \n", + "2. Repeat for Assistant B. \n", + "3. Compare the two sets of notes criterion by criterion. \n", + "4. Decide which answer is overall superior, or declare a tie if both are equally strong across all criteria.\n", + "\n", + "# Output Format\n", + "First provide a short, objective explanation (1–3 concise paragraphs). \n", + "Then on a new line output only one of the following tokens (without quotes or extra text):\n", + "• [[A]] – if Assistant A is better \n", + "• [[B]] – if Assistant B is better \n", + "• [[C]] – if it is a tie \n", + "\n", + "# Context (inserted at runtime)\n", + "[User Question] \n", + "{question}\n", + "\n", + "[The Start of Assistant A’s Answer] \n", + "{answer_a} \n", + "[The End of Assistant A’s Answer]\n", + "\n", + "[The Start of Assistant B’s Answer] \n", + "{answer_b} \n", + "[The End of Assistant B’s Answer]\n" + ] + } + ], + "source": [ + "best_practices_response = client.responses.create(\n", + " model=\"o3\",\n", + " input=\"BASELINE_PROMPT: \" + revised_prompt,\n", + " instructions=BEST_PRACTICES_SYSTEM_PROMPT,\n", + " reasoning={\"effort\": \"high\"}\n", + ")\n", + "\n", + "improved_prompt = best_practices_response.output_text\n", + "print(\"\\nImproved prompt:\\n\")\n", + "print(improved_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "c79c019a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
🕵️‍♂️ Critique & Diff (click to expand)
--- old
+++ new
@@ -1,28 +1,46 @@
-[System]
-Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should be based on the following criteria:
+# Role & Objective
+You are an impartial judge. Your goal is to determine which of two AI assistant answers better fulfills the user’s request.

-- Helpfulness: The extent to which the response addresses the user’s needs and provides useful information.
-- Relevance: How closely the response pertains to the user’s question and instructions.
-- Accuracy: The correctness and factual reliability of the information provided.
-- Depth: The level of insight, explanation, or reasoning demonstrated in the response.
-- Creativity: The originality or resourcefulness shown in addressing the question, where appropriate.
-- Level of Detail: The thoroughness and completeness of the response.
+# Instructions
+Follow the steps below exactly and remain strictly neutral:

-All criteria should be considered equally unless the user’s instructions indicate otherwise.
+1. Read the User Question and both assistant answers in full.
+2. Evaluate each answer against **all** six criteria, treating them with equal weight unless the user explicitly states otherwise:
+ • Helpfulness – Does the response address the user’s needs and provide useful information?
+ • Relevance – How closely does the response pertain to the user’s question and instructions?
+ • Accuracy – Is the information correct and factually reliable?
+ • Depth – Does the answer show insight, explanation, or reasoning?
+ • Creativity – Is the approach original or resourceful when appropriate?
+ • Level of Detail – Is the response thorough and complete?
+3. Stay impartial:
+ • Ignore the order in which the answers appear.
+ • Ignore the length of each answer.
+ • Ignore the assistants’ names.
+4. Make your decision solely on how well each response meets the criteria above.
+5. After your analysis, produce a final verdict using the exact format in the Output Format section.

-Begin your evaluation by comparing the two responses according to these criteria and provide a short explanation. Remain impartial by avoiding any position biases and ensuring that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses or the names of the assistants to influence your evaluation.
+# Reasoning Steps
+Think step by step:
+1. For each criterion, briefly note strengths and weaknesses for Assistant A.
+2. Repeat for Assistant B.
+3. Compare the two sets of notes criterion by criterion.
+4. Decide which answer is overall superior, or declare a tie if both are equally strong across all criteria.

-Be as objective as possible by strictly adhering to the defined criteria above and basing your judgment solely on how well each response meets them.
+# Output Format
+First provide a short, objective explanation (1–3 concise paragraphs).
+Then on a new line output only one of the following tokens (without quotes or extra text):
+• [[A]] – if Assistant A is better
+• [[B]] – if Assistant B is better
+• [[C]] – if it is a tie

-After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie. Choose "[[C]]" only if both responses are equally strong across all criteria.
-
-[User Question]
+# Context (inserted at runtime)
+[User Question]
{question}

-[The Start of Assistant A’s Answer]
-{answer_a}
+[The Start of Assistant A’s Answer]
+{answer_a}
[The End of Assistant A’s Answer]

-[The Start of Assistant B’s Answer]
-{answer_b}
+[The Start of Assistant B’s Answer]
+{answer_b}
[The End of Assistant B’s Answer]
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "show_critique_and_diff(revised_prompt, improved_prompt)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "openai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/Speech_transcription_methods.ipynb b/examples/Speech_transcription_methods.ipynb index 5c52be698a..a2256300a9 100644 --- a/examples/Speech_transcription_methods.ipynb +++ b/examples/Speech_transcription_methods.ipynb @@ -13,7 +13,9 @@ "\n", "By the end you will be able to select and use the appropriate transcription method for your use use cases.\n", "\n", - "*Note: For simplicity and ease of use, this notebook uses WAV audio files. Real-time microphone streaming (e.g., from web apps or microphones) is not utilized.*" + "*Note:*\n", + "- *This notebook uses WAV audio files for simplicity. It does **not** demonstrate real-time microphone streaming (such as from a web app or direct mic input).*\n", + "- *This notebook uses WebSockets to connect to the Realtime API. Alternatively, you can use WebRTC, see the [OpenAI docs](https://platform.openai.com/docs/guides/realtime#connect-with-webrtc) for details.*" ] }, { diff --git a/registry.yaml b/registry.yaml index fdc78fa2c9..5c1702d6e1 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,16 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Prompt Migration Guide + path: examples/Prompt_migration_guide.ipynb + date: 2025-06-26 + authors: + - minh-hoque + tags: + - prompt + - completions + - responses + - title: Fine-Tuning Techniques - Choosing Between SFT, DPO, and RFT (With a Guide to DPO) path: examples/Fine_tuning_direct_preference_optimization_guide.ipynb date: 2025-06-18 @@ -86,7 +96,7 @@ - evals - reinforcement -- title: Guide to Using the Responses API's MCP Tool +- title: Guide to Using the Responses API's MCP Tool path: examples/mcp/mcp_tool_guide.ipynb date: 2025-05-21 authors: