Steering LLM Output with Activation Engineering

Prompt engineering and fine tuning are the most common approaches for AI practitioners to guide the outputs of large language models. Activation engineering is an approach in research literature which involves changing model activations at inference time.

ActAdd is one example of this approach: https://arxiv.org/pdf/2308.10248.pdf. The paper comes with the following github repository: https://github.com/montemac/activation_additions

This repo uses the mistralai/Mistral-7B-Instruct-v0.2 model to produce summaries of news articles in Shakespearean style, including experimenting with the activation engineering approach.

Stack

Mistral-7B-Instruct-v0.2
transformer_lens-1.15.0

File Descriptions

standard_summary.py: Generate summary in modern English
shakespearean_summary_baseline.py: Generate summary in Shakespearean style using one-shot prompting, not utilizing activation engineering
shakespearean_summary_activation.py: Generate summary in Shakespearean style using activation engineering

How to Run

Install transformer_lens from local (current release only supports mistralai/Mistral-7B-Instruct-v0.1, not v0.2)

cd transformer_lens-1.15.0
poetry install
pip install -e .

Install torch and transformers

cd ..
pip install -r requirements.txt

Results

Reference results are in results.txt

Thoughts on Eval

If human-crafted reference summaries are available, ROUGE score can be a resonably reliable evaluation, but reference summaries are also very hard to scale.
It may be feasible to use an LLM for evaluation. For example, one could use a prompt similar to that in eval_prompt.jinja.
I experimented with DeepEval’s summarization evaluation. The idea is quite appealing, however, my low-volume experiments show results are quite unreliable at this stage.

Thoughts on Activation Engineering Experiments

Compared to the one-shot prompt baseline approach for creating Shakespearean summaries, I had great difficulty steering the model with activation engineering.

The same steering prompt pair is instantly effective for simple base prompts like 'I went up to my friend and said', but quite ineffective for the summarization prompt. Details are commented in shakespearean_summary_activation.py for later review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steering LLM Output with Activation Engineering

Stack

File Descriptions

How to Run

Results

Thoughts on Eval

Thoughts on Activation Engineering Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
activation_additions		activation_additions
transformer_lens-1.15.0		transformer_lens-1.15.0
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
eval_prompt.jinja		eval_prompt.jinja
requirements.txt		requirements.txt
results.txt		results.txt
shakespearean_summary_activation.py		shakespearean_summary_activation.py
shakespearean_summary_baseline.py		shakespearean_summary_baseline.py
standard_summary.py		standard_summary.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Steering LLM Output with Activation Engineering

Stack

File Descriptions

How to Run

Results

Thoughts on Eval

Thoughts on Activation Engineering Experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages