vim dataset

Note quick afternoon hack, so far might not pursue anytime soon.

The overall idea of this repository is to build a large, high quality dataset of vim edits, for those uninitiated, vim is a keyboard driven text-editor, it uses hot-keys to provide speedy navigation around text documents without needing a mouse. The motivation came after reading Repeat After Me: Transformers are Better than State Space Models at Copying where they show that SSMs can't copy, which is a shame because it is quite important. I've also recently started learning vim (notice me Pappa Primeagen), and think the abstractions that it implements present a unique and powerful solution to the problem.

I am not sure how it would be achieved, but if the model could have multiple "streams of conscious" so to speak, where each stream is a vim buffer, they can be added and removed dynamically with some kind of controller issuing vim commands to build a response to user query iteratively by default. Lofty ideas...for now the plan would just be to build a high quality dataset to teach some network the basics.

I hope to be able to build this fairly autonomously using the following steps:

Get a dataset of git diffs from projects with good git hygiene.
Use treesitter to build diffs for files at different context depths i.e function level, if/for/while statement level, class, file etc.
Use some LLMs to generate several predictions for how to perform the transforms
Validate the transforms
Roll good transforms in as examples an iterate from step 3 some number of time
Fine tune some models on dataset
Go to step 3 using the fine tuned models
Online RL? 🦄

I think this is a viable approach to build a dataset automatically, it would be good to collect live and real recordings of people developing in vim but the intricacies and security issues basically make it a non-starter

Here is some example schema of an initial dataset that I would like to build. eg.

source	target	transform
	hello world	ihello world{i}\033:wq\n
hello world	hello geezer	fwc$geezer\033:wq\n
hello	hello, world!	A, world!{esc}
good morning	good evening	ct evening{esc}
remove me	remove	$F d{esc}
start end	start and end	eand {esc}
unix	UNIX	gUiw

Note, all but the initial two of these were generated by GPT-4

Installation

First with poetry installed run poetry install

Then build the sandbox image, with make docker build

Then you should be good to go.

Dev

To run in dev mode run poetry install --group dev

Then run make tests to check everything is working.

Core functions

Vim sandbox

We use the Dockefile image to create a light-weight image with vim to act as a sandbox, curl and wget aren't included which limits the models ability to mess with the network, we do all file IO in and out of the sandbox via stdin, stdout and so the vim instance doesn't have access to the filesystem. This should stop all but the most sophisticated models.

Vim command apply

This is a function to apply a vim command sequence to a source text in Python. We use subproccessing to launch vim and supply the vim commands via a Pseudo-terminal using Pythons pty library, the function will launch vim, supply the commands, save the document and then read the final output from the file.

Example usage

    from vim_data.lib.subprocess_io import run_subprocess_with_input, run_subprocesses_with_batch_input

    cmd = ["docker", "run", "-i", "vim"]
    input_str = "ihello world{i}\033:wq\n"
    inputs = [input_str.format(i=i) for i in range(100)]
    output = await run_subprocesses_with_batch_input(cmd, inputs, 3)
    for out in output:
        print(out)

It works but could do with some improvements, it would be good to reuse the vim process between items in the batch, this could save some overhead but I couldn't find a way to make this work. Suggestions welcome!

vim recorder

This is a function to record the terminal output of a series of vim commands.

Here is some example code to generate a recording

from vim_data.lib.term_rec import record_vim_terminal
from pathlib import Path

Path('hello_world_rec.asciicast').write_text(await record_vim_terminal('iHello, Vim!\033'))

And we can then playback the recprding like so

Resources

VimGPT - an existing agent setup, they note GPT-4 is the only model really capable of this task at the moment

asciinema - term rec to gif

Obsidian Zola

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
tests		tests
vim_data		vim_data
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
netlify.toml		netlify.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vim dataset

Installation

Dev

Core functions

Vim sandbox

Vim command apply

vim recorder

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Nintorac/vim-data

Folders and files

Latest commit

History

Repository files navigation

vim dataset

Installation

Dev

Core functions

Vim sandbox

Vim command apply

vim recorder

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages