When using `substring`, how much text is too much? #1108

afg1 · 2025-02-03T17:13:09Z

afg1
Feb 3, 2025

I'm using guidance to provide a text snippet to support an assertion made by the LLM about something said in a paper. To do that I'm using substring to extract the supporting evidence - I want it to be a real quote from the paper.

I was quite surprised that this 'just worked' when I stuffed a whole section of paper into the function (e.g. the materials and methods of a paper like this).

However, sometimes the application hangs, and I thnk it is on running the substring selection, based on getting a stacktrace from interrupting it. Additionally I got an error with 0.2.0 where there were too many expressions constructed (I forget the exact error, I downgraded back to 0.1.16 where it worked), which I think was related to substring.

So, is there a sensible upper limit to the amount of text I can expect substring to work with? How would you suggest I extract a text snippet from a big chunk of text without using substring?

Thanks!

Harsha-Nori · 2025-02-03T21:03:01Z

Harsha-Nori
Feb 3, 2025
Maintainer

Hey @afg1, great question. Substring can be a bit expensive, but I think we have a path forward for optimizing it. Out of curiosity, how large were the documents (rough # words is fine, tokens even better) that you started to run into issues on?

As a very short term recommendation, if you're OK with supporting evidence falling on specific boundaries (e.g. a sentence or a set of sentences), you can get much better performance by just splitting it up and doing a (recursive) select over the sentence/paragraph space. But I would like to understand where you started seeing failures first because I agree that substring is a nicer developer experience (and much more flexible!)

4 replies

afg1 Feb 3, 2025
Author

Thanks @Harsha-Nori! Especially for the suggestion, I'll give that a try tomorrow!

I just sprinkled some logging around the place and I think its about 5,500 tokens in the substring call. The total context at this point is ~15k with 13k input and 2k generated tokens.

Doing a post mortem on one of my previous runs, this substring statement ran for about 3.5 hours before it finished.

If it's any use, the paper I'm looking at is this one, where I've only loaded the materials & methods and results sections. There's some sequences in there and a bunch of tables so I think it adds up quickly. I'm using bartowski/QwQ-32B-Preview-GGUF in Llama.cpp as well, in case that is important, I know some strange things can happen with Llama.cpp's tokenizers...

Thanks again for the help!

afg1 Feb 4, 2025
Author

By the way, I re-ran with guidance==0.2.0 and the message I got was

Warning: lexer error: too many expressions constructed; stopping
lexer error: too many expressions constructed

I searched through guidance's code and couldn't figure out where this was coming from.

I think this also happens in select because I tried a hybrid approach (choose most relevant paragraph with select, extract most relevant substring with substring). It's working ok for me under 0.16.0 so I'm going to stick with that for now

Harsha-Nori Feb 4, 2025
Maintainer

Thanks @afg1 ! Taggibg @mmoskal and @hudson-ai for awareness. Surprised it happens on a paragraph level -- we should consider relaxing the boundary for python guidance

hudson-ai Feb 4, 2025
Collaborator

@afg1 would you mind sharing your code or a minimal reproducing case for this error? It would be really helpful in tracking down the source of the issue for you :)

afg1 · 2025-02-04T19:45:19Z

afg1
Feb 4, 2025
Author

This reliably falls over for me with a 14B model, and is an extract of the code I'm using, just with stuff stuck together from a few modules. Hopefully that won't impact debugging. I hardcoded the hub cache path, for this example, but I have a slightly more sophisticated model loading logic in the real thing, hopefully that doesn't matter.

text.txt - Just the results section extracted from this paper without tables or figure captions. This is 3200 tokens, some places in my workflow will load ~5k for this step, sometimes with 5-7k in context already.

Versions:
guidance version: 0.2.0 (with the change to tokenizer loading from #1105)
llama-cpp-python version: 0.3.7

Also fails with
guidance version: 0.2.0 (from pip)
llama-cpp-python version: 0.3.6

Hardware:
I can get this to happen on an M1 macbook pro, and I saw it happen on our cluster with an A100, but I can't double check it reproduces there as well right now.

Sorry this turned into a wall of text, but hopefully this has all the information your need!

model_path = "~/.cache/huggingface/hub/models--bartowski--Phi-3-medium-128k-instruct-GGUF/snapshots/950c8154e91f33193217eebf1c26903af44c6e13/Phi-3-medium-128k-instruct-Q4_K_M.gguf"
    llm = LlamaCpp(model=model_path,
        echo=False,
        n_gpu_layers=-1,
        n_ctx=32768,
        flash_attention=True,
        temperature=0.2,
        chat_template=ChatMLTemplate,
        seed=-1,)

    article_text = open("text.txt", 'r').read()
    rna_id = "miR-155"
    step_prompt = "Does the paper present evidence of a functional interatction between miRNA and mRNA? This must be in the form of a reporter assay"

    with user():
        llm += f"You will be asked a yes/no question. The answer could be in following text, or it could be in some text you have already seen: \n{article_text}\n\n"
        llm += f"Question: {step_prompt}\nRestrict your considerations to {rna_id} if there are multiple RNAs mentioned\n"

        llm += "Explain your reasoning step-by-step.\n"
            
    logger.info(f"LLM input tokens: {llm.engine.metrics.engine_input_tokens}")
    logger.info(f"LLM generated tokens: {llm.engine.metrics.engine_output_tokens}")
    logger.info(
        f"LLM total tokens: {llm.engine.metrics.engine_input_tokens + llm.engine.metrics.engine_output_tokens}"
    )
    with assistant():
        llm += (
            with_temperature(
                gen(
                    "reasoning",
                    max_tokens=1024,
                    stop=STOP_TOKENS,
                ),
                0.4,
            )
            + "\n"
        )

    with assistant():
        llm += f"The final answer, based on my reasoning above is: " + with_temperature(
            select(["yes", "no"], name="answer"), 0.1
        )
    
    article_paragraphs = list(filter(lambda x: len(x) > 0, article_text.split("\n")))
    with user():
        llm += "Choose the most relevant paragraph from the article to support your claim\n"
    with assistant():
        llm += f"The most relevant paragraph is: {select(article_paragraphs, name='relevant_para')}\n"
    paragraph = llm['relevant_para']
    logging.info(f"chose this paragraph:\n{paragraph}\n")
    with user():
        llm += "Now choose the most relevant piece of evidence within that paragraph that supports your claim.\n"
    with assistant():
        llm += f"The most relevant piece of evidence is: '{substring(paragraph, name='evidence')}'"
    print(llm)

Interestingly, it works fine if I use a small model like Qwen 0.5B or Llama 3.2 1B. Maybe its a memory thing?

This has the selection of paragraph and substring extraction. I think it is initially falling over selecting a paragraph, because I don't see any log output after loading the text into context, and this time I got a traceback, which didn't happen when I first saw this:

% python llama_cpp_test.py
guidance version: 0.2.0
llama-cpp-python version: 0.3.6
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
guidance version: 0.2.0
llama-cpp-python version: 0.3.6

INFO:__main__:Appending 3200 tokens
INFO:__main__:LLM input tokens: 0
INFO:__main__:LLM generated tokens: 0
INFO:__main__:LLM total tokens: 0
guidance version: 0.2.0
llama-cpp-python version: 0.3.6
WARNING:guidance.models._model:gpustat is not installed, run `pip install gpustat` to collect GPU stats.
Warning: lexer error: too many expressions constructed; stopping
Traceback (most recent call last):
  File "/Users/agreen/tmp/llama_cpp_test.py", line 90, in <module>
    llm += f"The most relevant paragraph is: {select(article_paragraphs, name='relevant_para')}\n"
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/models/_model.py", line 1198, in __add__
    out = lm + partial_grammar
          ~~~^~~~~~~~~~~~~~~~~
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/models/_model.py", line 1207, in __add__
    out = lm._run_stateless(value)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/models/_model.py", line 1413, in _run_stateless
    for chunk in gen_obj:
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/models/_model.py", line 431, in __call__
    tokens, mask_fut, backtrack = parser.advance(engine_output)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/_parser.py", line 78, in advance
    return self._generator.send(engine_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/agreen/.pyenv/versions/mirna-curator/lib/python3.12/site-packages/guidance/_parser.py", line 153, in _parse
    backtrack, ff_tokens = self.ll_interpreter.commit_token(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: lexer error: too many expressions constructed

0 replies

mmoskal · 2025-02-05T01:55:12Z

mmoskal
Feb 5, 2025
Maintainer

The new substring Rust implementation by @hudson-ai can handle up to around 10k elements (so 5k words when spliting on words (you'll get 10k elements because spaces will be separate) or 10k characters when splitting on characters). Just added tests in Rust but this is not yet exposed in Python in any way.

As for large selects, we seem to be able to handle up to a few megabytes (this should work already in guidance)

3 replies

afg1 Feb 5, 2025
Author

Thanks! This looks really cool!

So I think my problem will go away very soon then once this is fully exposed in python? When you say 10k elements, is that going to mean 10k tokens, or is it literal characters/words? I don't currently have the required level of rust to tell.

Thank you all for your help and working on this! I'd like to help document some of this stuff, since I probably can't contribute meaningfully to the code, would that be useful? Where would it go? I guess the docstring of the substring function?

mmoskal Feb 5, 2025
Maintainer

Substring really works on list of "chunks". The one in Guidance currently assumes these chunks to be single-character (or maybe even single byte). When generating citations with LLMs you typically don't want them to start or end in the middle of the word, so it makes more sense to have the substring() enforce the output to be a sub-sequence of words, not characters. Now, we currently split like this "foo bar baz" -> ["foo", " ", "bar", " ", "baz"] so you end up with twice as many chunks as words.

So in summary, in a typical scenario it would be 5k words.

BTW, it's extremely hard to tell the exact limit from the code (I didn't, just tested), it comes from the limit on regular expression complexity.

hudson-ai Feb 5, 2025
Collaborator

@afg1 some help with the documentation (docstrings should be sufficient, at least at first) would be greatly appreciated once we expose the rust version through python! Hopefully we won't make you wait too long on that...

So I think my problem will go away very soon then once this is fully exposed in python?

Hopefully, but we'll have to see it in action before I can give you a confident "yes". Will keep you in the loop and any feedback you have will be super useful 😄

The biggest change in this version is really the chunking logic that @mmoskal outlined above. Word-based chunks will probably give a pretty good speedup and allow us to handle roughly 3x longer input texts (5k words should be approximately 30k characters).

Otherwise, the rust implementation is little more than a transliteration of the existing python implementation. "Compile time" (constructing the grammar from your input text) should be substantially faster too. I think through the python interface, we'll probably make "word chunking" the default but expose an interface for user-specified chunks, e.g. if you wanted to use nltk's sentence tokenizer or something.

When using substring, how much text is too much? #1108

Uh oh!

afg1 Feb 3, 2025

Replies: 3 comments · 7 replies

Uh oh!

Harsha-Nori Feb 3, 2025 Maintainer

Uh oh!

afg1 Feb 3, 2025 Author

Uh oh!

afg1 Feb 4, 2025 Author

Uh oh!

Harsha-Nori Feb 4, 2025 Maintainer

Uh oh!

hudson-ai Feb 4, 2025 Collaborator

Uh oh!

afg1 Feb 4, 2025 Author

Uh oh!

mmoskal Feb 5, 2025 Maintainer

Uh oh!

afg1 Feb 5, 2025 Author

Uh oh!

Uh oh!

mmoskal Feb 5, 2025 Maintainer

Uh oh!

hudson-ai Feb 5, 2025 Collaborator

When using `substring`, how much text is too much? #1108

afg1
Feb 3, 2025

Replies: 3 comments 7 replies

Harsha-Nori
Feb 3, 2025
Maintainer

afg1 Feb 3, 2025
Author

afg1 Feb 4, 2025
Author

Harsha-Nori Feb 4, 2025
Maintainer

hudson-ai Feb 4, 2025
Collaborator

afg1
Feb 4, 2025
Author

mmoskal
Feb 5, 2025
Maintainer

afg1 Feb 5, 2025
Author

mmoskal Feb 5, 2025
Maintainer

hudson-ai Feb 5, 2025
Collaborator