Skip to content

Nintorac/vim-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vim dataset

Note quick afternoon hack, so far might not pursue anytime soon.

The overall idea of this repository is to build a large, high quality dataset of vim edits, for those uninitiated, vim is a keyboard driven text-editor, it uses hot-keys to provide speedy navigation around text documents without needing a mouse. The motivation came after reading Repeat After Me: Transformers are Better than State Space Models at Copying where they show that SSMs can't copy, which is a shame because it is quite important. I've also recently started learning vim (notice me Pappa Primeagen), and think the abstractions that it implements present a unique and powerful solution to the problem.

I am not sure how it would be achieved, but if the model could have multiple "streams of conscious" so to speak, where each stream is a vim buffer, they can be added and removed dynamically with some kind of controller issuing vim commands to build a response to user query iteratively by default. Lofty ideas...for now the plan would just be to build a high quality dataset to teach some network the basics.

I hope to be able to build this fairly autonomously using the following steps:

  1. Get a dataset of git diffs from projects with good git hygiene.
  2. Use treesitter to build diffs for files at different context depths i.e function level, if/for/while statement level, class, file etc.
  3. Use some LLMs to generate several predictions for how to perform the transforms
  4. Validate the transforms
  5. Roll good transforms in as examples an iterate from step 3 some number of time
  6. Fine tune some models on dataset
  7. Go to step 3 using the fine tuned models
  8. Online RL? 🦄

I think this is a viable approach to build a dataset automatically, it would be good to collect live and real recordings of people developing in vim but the intricacies and security issues basically make it a non-starter

Here is some example schema of an initial dataset that I would like to build. eg.

source target transform
hello world ihello world{i}\033:wq\n
hello world hello geezer fwc$geezer\033:wq\n
hello hello, world! A, world!{esc}
good morning good evening ct evening{esc}
remove me remove $F d{esc}
start end start and end eand {esc}
unix UNIX gUiw

Note, all but the initial two of these were generated by GPT-4

Installation

First with poetry installed run poetry install

Then build the sandbox image, with make docker build

Then you should be good to go.

Dev

To run in dev mode run poetry install --group dev

Then run make tests to check everything is working.

Core functions

Vim sandbox

We use the Dockefile image to create a light-weight image with vim to act as a sandbox, curl and wget aren't included which limits the models ability to mess with the network, we do all file IO in and out of the sandbox via stdin, stdout and so the vim instance doesn't have access to the filesystem. This should stop all but the most sophisticated models.

Vim command apply

This is a function to apply a vim command sequence to a source text in Python. We use subproccessing to launch vim and supply the vim commands via a Pseudo-terminal using Pythons pty library, the function will launch vim, supply the commands, save the document and then read the final output from the file.

Example usage

    from vim_data.lib.subprocess_io import run_subprocess_with_input, run_subprocesses_with_batch_input

    cmd = ["docker", "run", "-i", "vim"]
    input_str = "ihello world{i}\033:wq\n"
    inputs = [input_str.format(i=i) for i in range(100)]
    output = await run_subprocesses_with_batch_input(cmd, inputs, 3)
    for out in output:
        print(out)

It works but could do with some improvements, it would be good to reuse the vim process between items in the batch, this could save some overhead but I couldn't find a way to make this work. Suggestions welcome!

vim recorder

This is a function to record the terminal output of a series of vim commands.

Here is some example code to generate a recording

from vim_data.lib.term_rec import record_vim_terminal
from pathlib import Path

Path('hello_world_rec.asciicast').write_text(await record_vim_terminal('iHello, Vim!\033'))

And we can then playback the recprding like so

vim recording example

Resources

VimGPT - an existing agent setup, they note GPT-4 is the only model really capable of this task at the moment

asciinema - term rec to gif

Obsidian Zola

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published