Note quick afternoon hack, so far might not pursue anytime soon.
The overall idea of this repository is to build a large, high quality dataset of vim edits, for those uninitiated, vim is a keyboard driven text-editor, it uses hot-keys to provide speedy navigation around text documents without needing a mouse. The motivation came after reading Repeat After Me: Transformers are Better than State Space Models at Copying where they show that SSMs can't copy, which is a shame because it is quite important. I've also recently started learning vim (notice me Pappa Primeagen), and think the abstractions that it implements present a unique and powerful solution to the problem.
I am not sure how it would be achieved, but if the model could have multiple "streams of conscious" so to speak, where each stream is a vim buffer, they can be added and removed dynamically with some kind of controller issuing vim commands to build a response to user query iteratively by default. Lofty ideas...for now the plan would just be to build a high quality dataset to teach some network the basics.
I hope to be able to build this fairly autonomously using the following steps:
- Get a dataset of git diffs from projects with good git hygiene.
- Use treesitter to build diffs for files at different context depths i.e function level, if/for/while statement level, class, file etc.
- Use some LLMs to generate several predictions for how to perform the transforms
- Validate the transforms
- Roll good transforms in as examples an iterate from step 3 some number of time
- Fine tune some models on dataset
- Go to step 3 using the fine tuned models
- Online RL? 🦄
I think this is a viable approach to build a dataset automatically, it would be good to collect live and real recordings of people developing in vim but the intricacies and security issues basically make it a non-starter
Here is some example schema of an initial dataset that I would like to build. eg.
source | target | transform |
---|---|---|
hello world | ihello world{i}\033:wq\n | |
hello world | hello geezer | fwc$geezer\033:wq\n |
hello | hello, world! | A, world!{esc} |
good morning | good evening | ct evening{esc} |
remove me | remove | $F d{esc} |
start end | start and end | eand {esc} |
unix | UNIX | gUiw |
Note, all but the initial two of these were generated by GPT-4
First with poetry installed run poetry install
Then build the sandbox image, with make docker build
Then you should be good to go.
To run in dev mode run poetry install --group dev
Then run make tests
to check everything is working.
We use the Dockefile
image to create a light-weight image with vim to act as a sandbox, curl and wget aren't included which limits the models ability to mess with the network, we do all file IO in and out of the sandbox via stdin, stdout and so the vim instance doesn't have access to the filesystem. This should stop all but the most sophisticated models.
This is a function to apply a vim command sequence to a source text in Python. We use subproccessing to launch vim and supply the vim commands via a Pseudo-terminal using Pythons pty
library, the function will launch vim, supply the commands, save the document and then read the final output from the file.
Example usage
from vim_data.lib.subprocess_io import run_subprocess_with_input, run_subprocesses_with_batch_input
cmd = ["docker", "run", "-i", "vim"]
input_str = "ihello world{i}\033:wq\n"
inputs = [input_str.format(i=i) for i in range(100)]
output = await run_subprocesses_with_batch_input(cmd, inputs, 3)
for out in output:
print(out)
It works but could do with some improvements, it would be good to reuse the vim process between items in the batch, this could save some overhead but I couldn't find a way to make this work. Suggestions welcome!
This is a function to record the terminal output of a series of vim commands.
Here is some example code to generate a recording
from vim_data.lib.term_rec import record_vim_terminal
from pathlib import Path
Path('hello_world_rec.asciicast').write_text(await record_vim_terminal('iHello, Vim!\033'))
And we can then playback the recprding like so
VimGPT - an existing agent setup, they note GPT-4 is the only model really capable of this task at the moment