This repository contains the complete codebase and datasets used in our research paper "Bullying the Machine: How Personas Increase LLM Vulnerability".
⚠️ Important Notice: All materials in this repository are intended solely for academic research purposes. The authors bear no responsibility for any misuse of the code, data, or findings presented here.
- Persona phrases: Affirmative and disaffirmative expressions are stored in
data/traits.json
- Adversarial benchmark: A curated 50-sample subset from AdvBench is available in
data/advbench50.csv
, sourced from patrickrchao/JailbreakingLLMs
The workflow consists of three sequential phases: dialogue simulation, content moderation, and result analysis. Each phase is implemented as a separate script in this repository.
This script is used to simulate the dialogue between the attacker and the victim/target LLM.
Usage:
python main_simulation.py --seed $SEED --target_model_id $TARGET_MODEL_ID --output_dir $OUTPUT_DIR --exp_def_path $EXP_DEF_PATH
Parameters:
TARGET_MODEL_ID
: a complete model ID supported byunsloth
, for example,unsloth/Meta-Llama-3.1-8B-Instruct
OUTPUT_DIR
: directory holding the simulation results. To help the moderation script locate the simulated dialogues,OUTPUT_DIR
must look likeoutput/${EXP_NAME}/dialogue/
, whereEXP_NAME
is a unique identifier.EXP_DEF_PATH
: path to the experiment definition files, for exampleexp_defs/exp_defs_mini5.py
. Settings used in the paper can be found inexp_defs
.
Output: The script generates multiple pickle files in OUTPUT_DIR
with naming convention X_Y_Z_0000S.pk
, where
X
= attacker goal indexY
= victim persona indexZ
= attacker tactic indexS
= experiment repetition index (seed = 42 + S)
This script evaluates the safety and appropriateness of generated conversations.
Usage:
python main_moderation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS
Parameters:
EXP_NAME
: Experiment identifier matching the simulation phaseEXP_DEFS
: Same experiment definition file used during simulation
Output: Moderation results are saved to output/${EXP_NAME}/moderation/
as individual pickle files for each dialogue.
This script aggregates moderation results into statistics used for insights presented in the paper. Users may wish to develop custom analysis scripts for different research questions.
Usage:
python result_aggregation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS
Output: Aggregated statistics are saved as aggregated_results/statistics.pkl
in output/${EXP_NAME}/
, containing unsafe counts across tactics, personas, and goals.
If this research contributes to your work, please cite our paper:
@article{Xu2025bullying,
author = {Ziwei Xu and
Udit Sanghi and
Mohan S. Kankanhalli},
title = {Bullying the Machine: How Personas Increase {LLM} Vulnerability},
journal = {CoRR},
volume = {abs/2505.12692},
year = {2025}
}