Skip to content

ZiweiXU/llm_bully

Repository files navigation

Bullying the Machine: How Personas Increase LLM Vulnerability

This repository contains the complete codebase and datasets used in our research paper "Bullying the Machine: How Personas Increase LLM Vulnerability".

⚠️ Important Notice: All materials in this repository are intended solely for academic research purposes. The authors bear no responsibility for any misuse of the code, data, or findings presented here.

Available Data

  • Persona phrases: Affirmative and disaffirmative expressions are stored in data/traits.json
  • Adversarial benchmark: A curated 50-sample subset from AdvBench is available in data/advbench50.csv, sourced from patrickrchao/JailbreakingLLMs

Workflow

The workflow consists of three sequential phases: dialogue simulation, content moderation, and result analysis. Each phase is implemented as a separate script in this repository.

Phase 1: Dialogue Simulation (main_simulation.py)

This script is used to simulate the dialogue between the attacker and the victim/target LLM.

Usage:

python main_simulation.py --seed $SEED --target_model_id $TARGET_MODEL_ID --output_dir $OUTPUT_DIR --exp_def_path $EXP_DEF_PATH

Parameters:

  • TARGET_MODEL_ID: a complete model ID supported by unsloth, for example, unsloth/Meta-Llama-3.1-8B-Instruct
  • OUTPUT_DIR: directory holding the simulation results. To help the moderation script locate the simulated dialogues, OUTPUT_DIR must look like output/${EXP_NAME}/dialogue/, where EXP_NAME is a unique identifier.
  • EXP_DEF_PATH: path to the experiment definition files, for example exp_defs/exp_defs_mini5.py. Settings used in the paper can be found in exp_defs.

Output: The script generates multiple pickle files in OUTPUT_DIR with naming convention X_Y_Z_0000S.pk, where

  • X = attacker goal index
  • Y = victim persona index
  • Z = attacker tactic index
  • S = experiment repetition index (seed = 42 + S)

Phase 2: Content Moderation (main_moderation.py)

This script evaluates the safety and appropriateness of generated conversations.

Usage:

python main_moderation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS

Parameters:

  • EXP_NAME: Experiment identifier matching the simulation phase
  • EXP_DEFS: Same experiment definition file used during simulation

Output: Moderation results are saved to output/${EXP_NAME}/moderation/ as individual pickle files for each dialogue.

Phase 3: Result Analysis (result_aggregation.py)

This script aggregates moderation results into statistics used for insights presented in the paper. Users may wish to develop custom analysis scripts for different research questions.

Usage:

python result_aggregation.py --exp_name $EXP_NAME --exp_defs $EXP_DEFS

Output: Aggregated statistics are saved as aggregated_results/statistics.pkl in output/${EXP_NAME}/, containing unsafe counts across tactics, personas, and goals.

Citation

If this research contributes to your work, please cite our paper:

@article{Xu2025bullying,
  author       = {Ziwei Xu and
                  Udit Sanghi and
                  Mohan S. Kankanhalli},
  title        = {Bullying the Machine: How Personas Increase {LLM} Vulnerability},
  journal      = {CoRR},
  volume       = {abs/2505.12692},
  year         = {2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages