Bullying the Machine: How Personas Increase LLM Vulnerability

This repository contains the complete codebase and datasets used in our research paper "Bullying the Machine: How Personas Increase LLM Vulnerability".

⚠️ Important Notice: All materials in this repository are intended solely for academic research purposes. The authors bear no responsibility for any misuse of the code, data, or findings presented here.

Available Data

Persona phrases: Affirmative and disaffirmative expressions are stored in data/traits.json
Adversarial benchmark: A curated 50-sample subset from AdvBench is available in data/advbench50.csv, sourced from patrickrchao/JailbreakingLLMs

Workflow

The workflow consists of three sequential phases: dialogue simulation, content moderation, and result analysis. Each phase is implemented as a separate script in this repository.

Phase 1: Dialogue Simulation (`main_simulation.py`)

This script is used to simulate the dialogue between the attacker and the victim/target LLM.

Usage:

python main_simulation.py --seed $SEED --target_model_id $TARGET_MODEL_ID --output_dir $OUTPUT_DIR --exp_def_path $EXP_DEF_PATH

Parameters:

TARGET_MODEL_ID: a complete model ID supported by unsloth, for example, unsloth/Meta-Llama-3.1-8B-Instruct
OUTPUT_DIR: directory holding the simulation results. To help the moderation script locate the simulated dialogues, OUTPUT_DIR must look like output/${EXP_NAME}/dialogue/, where EXP_NAME is a unique identifier.
EXP_DEF_PATH: path to the experiment definition files, for example exp_defs/exp_defs_mini5.py. Settings used in the paper can be found in exp_defs.

Output: The script generates multiple pickle files in OUTPUT_DIR with naming convention X_Y_Z_0000S.pk, where

X = attacker goal index
Y = victim persona index
Z = attacker tactic index
S = experiment repetition index (seed = 42 + S)

Phase 2: Content Moderation (`main_moderation.py`)

This script evaluates the safety and appropriateness of generated conversations.

Usage:

python main_moderation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS

Parameters:

EXP_NAME: Experiment identifier matching the simulation phase
EXP_DEFS: Same experiment definition file used during simulation

Output: Moderation results are saved to output/${EXP_NAME}/moderation/ as individual pickle files for each dialogue.

Phase 3: Result Analysis (`result_aggregation.py`)

This script aggregates moderation results into statistics used for insights presented in the paper. Users may wish to develop custom analysis scripts for different research questions.

Usage:

python result_aggregation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS

Output: Aggregated statistics are saved as aggregated_results/statistics.pkl in output/${EXP_NAME}/, containing unsafe counts across tactics, personas, and goals.

Citation

If this research contributes to your work, please cite our paper:

@article{Xu2025bullying,
  author       = {Ziwei Xu and
                  Udit Sanghi and
                  Mohan S. Kankanhalli},
  title        = {Bullying the Machine: How Personas Increase {LLM} Vulnerability},
  journal      = {CoRR},
  volume       = {abs/2505.12692},
  year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
exp_defs		exp_defs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bfpt.py		bfpt.py
bully.py		bully.py
main_moderation.py		main_moderation.py
main_simulation.py		main_simulation.py
requirements.txt		requirements.txt
result_aggregation.py		result_aggregation.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bullying the Machine: How Personas Increase LLM Vulnerability

Available Data

Workflow

Phase 1: Dialogue Simulation (`main_simulation.py`)

Phase 2: Content Moderation (`main_moderation.py`)

Phase 3: Result Analysis (`result_aggregation.py`)

Citation

About

Uh oh!

Releases

Packages

Languages

License

ZiweiXU/llm_bully

Folders and files

Latest commit

History

Repository files navigation

Bullying the Machine: How Personas Increase LLM Vulnerability

Available Data

Workflow

Phase 1: Dialogue Simulation (main_simulation.py)

Phase 2: Content Moderation (main_moderation.py)

Phase 3: Result Analysis (result_aggregation.py)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Phase 1: Dialogue Simulation (`main_simulation.py`)

Phase 2: Content Moderation (`main_moderation.py`)

Phase 3: Result Analysis (`result_aggregation.py`)

Packages