- 2023.06.23: νκ΅μ΄ λν νκ° κ²°κ³Ό 곡κ°
- 2023.06.08: π€Polyglot-ko 5.8B κΈ°λ° KULLM-Polyglot-5.8B-v2 fp16 λͺ¨λΈ 곡κ°
- 2023.06.01: ꡬλ¦(KULLM) λ°μ΄ν°μ v2 HuggingFace Datasets 곡κ°
- 2023.05.31: π€Polyglot-ko 12.8B κΈ°λ° KULLM-Polyglot-12.8B-v2 fp16 λͺ¨λΈ 곡κ°
- 2023.05.30: π€Polyglot-ko 12.8B κΈ°λ° KULLM-Polyglot-12.8B fp16 λͺ¨λΈ 곡κ°
KULLM(ꡬλ¦)μ κ³ λ €λνκ΅ NLP & AI μ°κ΅¬μ€κ³Ό HIAI μ°κ΅¬μκ° κ°λ°ν νκ΅μ΄ Large Language Model (LLM) μ λλ€.
κ΅¬λ¦ νλ‘μ νΈλ νκ΅μ΄ λͺ¨λΈ λΏλ§ μλλΌ, λ°μ΄ν° μ κΉμ§ μ λ©΄ 곡κ°νμ¬ νκ΅μ΄ LLM μνκ³μ κΈ°μ¬νκ³ μ νμμ΅λλ€.
KULLM(ꡬλ¦)μ Backbone Modelλ‘ Polyglot-koμ μ¬μ©νμ¬ νμ΅μ μ§ννμμ΅λλ€.
- Polyglot-ko 5.8B κΈ°λ°-v2 -> π€ nlpai-lab/kullm-polyglot-5.8b-v2
- Polyglot-ko 12.8B κΈ°λ°-v2 -> π€ nlpai-lab/kullm-polyglot-12.8b-v2
- Polyglot-ko 12.8B κΈ°λ°-v1 -> π€ metterian/kullm-polyglot-12.8b-v1
- λ°μ΄ν°μ v1: GPT4ALL
Metaμ LLaMA λͺ¨λΈμ λ°±λ³ΈμΌλ‘ λ§λ λͺ¨λΈμ ν μ€νΈ κ²°κ³Ό νκ΅μ΄ μ±λ₯μ΄ μ’μ§ λͺ»νμ¬ κ³΅κ°νμ§ μκΈ°λ‘ νμ΅λλ€. μΆν μ¬λ¬ μ’μ νκ΅μ΄ μ±λ₯μ 보μ¬μ£Όλ LLM λͺ¨λΈμ νμ΅νμ¬ κ³΅κ°νκ³ μ ν©λλ€.
- μ΅μ λ²μ torch / HF λΌμ΄λΈλ¬λ¦¬ μ€μΉ
pip install -U torch transformers tokenizers accelerateμλ μμ μ½λλ‘ μ€νν΄λ³Ό μ μμ΅λλ€.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from utils.prompter import Prompter
MODEL = "nlpai-lab/kullm-polyglot-5.8b-v2"
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(device=f"cuda", non_blocking=True)
model.eval()
pipe = pipeline("text-generation", model=model, tokenizer=MODEL, device=0)
prompter = Prompter("kullm")
def infer(instruction="", input_text=""):
prompt = prompter.generate_prompt(instruction, input_text)
output = pipe(prompt, max_length=512, temperature=0.2, num_beams=5, eos_token_id=2)
s = output[0]["generated_text"]
result = prompter.get_response(s)
return result
result = infer(input_text="κ³ λ €λνκ΅μ λν΄μ μλ €μ€")
print(result)
# 'κ³ λ €λνκ΅μ λν΄ κΆκΈν μ μ΄ μμΌμλ©΄ μΈμ λ μ§ λ¬Έμν΄ μ£ΌμΈμ. κ³ λ €λνκ΅λ νκ΅μμ κ°μ₯ μ€λλκ³ κΆμ μλ λνκ΅ μ€ νλλ‘, κ³ λ €λνκ΅μ μμ¬λ νκ΅μ μμ¬μ ν¨κ»ν΄ μμ΅λλ€. κ³ λ €λνκ΅λ νλ¬Έμ μ°μμ±μ μΆκ΅¬νλ λμμ μ¬νμ μ±
μμ λ€νκΈ° μν΄ μ΅μ μ λ€νκ³ μμ΅λλ€. κ³ λ €λνκ΅λ νμ, κ΅μμ§, κ΅μ§μμ μν λ€μν νλ‘κ·Έλ¨κ³Ό μ§μμ μ 곡νλ κ²μΌλ‘ μ λͺ
ν©λλ€. κ³ λ €λνκ΅λ νκ΅μ μ μΉ, κ²½μ , μ¬ν λΆμΌμμ μ€μν μν μ λ΄λΉνκ³ μμ΅λλ€. κ³ λ €λνκ΅μ λν΄ λ μμΈν μκ³ μΆμΌμ κ°μ?'κ΅¬λ¦ λ°μ΄ν°μ v2λ GPT-4-LLM, Vicuna, κ·Έλ¦¬κ³ Databricksμ Dolly λ°μ΄ν°μ μ λ³ν©ν κ²μ λλ€. μ΄ λͺ¨λ λ°μ΄ν°μ μ DeepLμ μ΄μ©νμ¬ νκ΅μ΄λ‘ λ²μλμμ΅λλ€.
GPT4ALLμ instruction tuned assistant-style language modelμ΄λ©°, Vicunaμ Dolly λ°μ΄ν°μ μ λ€μν μμ°μ΄ μ²λ¦¬ λ¬Έμ λ₯Ό ν΄κ²°νλ λ° νμ©λ©λλ€. νΉν, Dollyλ instruction/response fine tuning recordsλ₯Ό νλ ¨ λ°μ΄ν°λ‘ μ¬μ©ν μΈμ΄ λͺ¨λΈμ λλ€.
from datasets import load_dataset
ds = load_dataset("nlpai-lab/kullm-v2", split="train")
ds
DatasetDict({
train: Dataset({
features: ['id', 'instruction', 'input', 'output'],
num_rows: 152630
})
})κ΅¬λ¦ λ°μ΄ν°μ v1μ GPT4ALLμ κΈ°λ°μΌλ‘ ν©λλ€.
GPT4ALL λ°μ΄ν°μ μ λ€μκ³Ό κ°μ΄ Instruct λΆλΆκ³Ό Input, κ·Έλ¦¬κ³ Output λΆλΆμΌλ‘ ꡬμ±λμ΄μμ΅λλ€.
{
"id": "user_oriented_task_235",
"motivation_app": "Yelp",
"instruction": "μ λ¬Έ λΆμΌμ λ°λΌ λ μ€ν λ, ν μλΉμ€, μλμ°¨ μλΉμ€, κΈ°ν μ€ νλλ‘ λΉμ¦λμ€λ₯Ό λΆλ₯ν©λλ€.",
"instances": [
{
"input": "견μ μ λ°μΌλ €λ©΄ 650-636-4884λ‘ μ ννκ±°λ μΉμ¬μ΄νΈλ₯Ό λ°©λ¬ΈνμΈμ. μ΄ λ§€μ₯μ μ ν νμ΄μ΄ λ° μΌλ° μλμ°¨ μ리λ₯Ό μ λ¬ΈμΌλ‘ ν©λλ€. λͺ¨λ νμ΄μ΄λ₯Ό μ체μ μΌλ‘ 보μ νκ³ μμΌλ©° μμ°μ΄λ μ°¨λ νΉμ±μ λ§λ λ€μν νμ΄μ΄λ₯Ό 보μ νκ³ μμ΅λλ€. μ΄λ€ νμ΄μ΄κ° νμνμ§ μ λͺ¨λ₯΄μκ² λ€λ©΄ μ λ¬Έκ°κ° μμ£Όνμ¬ κ³ κ°μ μꡬμ κ°μ₯ μ ν©ν νμ΄μ΄λ₯Ό μ νν μ μλλ‘ λμλ립λλ€. λν μμ©μ°¨ νμ΄μ΄λ μ·¨κΈνκ³ μμ΄ λ€μν μ°¨λμ λ§λ νμ΄μ΄λ₯Ό μ 곡ν μ μμ΅λλ€.",
"output": "Auto Services"
}
]
},νκ΅μ΄λ‘ λ²μλ λ°μ΄ν°μ
μ kullm-v2.jsonlμ μ μ₯λμ΄ μμ΅λλ€.
KULLMμ νκ΅μ΄ λͺ¨λΈλ‘ Polyglot 12.8B λͺ¨λΈμ Low Rank Adaptation (LoRA)λ₯Ό μ¬μ©νμ¬ νμ΅νμμ΅λλ€.
λͺ¨λΈ νμ΅μ A100 80GB 4λλ‘ μ§ννμ΅λλ€. νμ΅μ μ¬μ©ν μ½λλ tloen/alpaca-loraμ κΈ°λ°μΌλ‘ μ¬μ©νμμ΅λλ€.
π€ Huggingface Repo: https://huggingface.co/nlpai-lab/kullm-polyglot-12.8b-v2
λͺ¨λΈ νμ΅μ κ΅¬λ¦ λ°μ΄ν°μ v2 (GPT4ALL, Dolly, Vicuna)μ μ¬μ©νμ¬ μ§ννμ΅λλ€. μ΄ 8 epoch νμ΅νμμΌλ©°, A100 80GB 4λλ₯Ό μ¬μ©νμ΅λλ€.
π€ Huggingface Repo: π€ https://huggingface.co/metterian/kullm-polyglot-12.8b-v1
λͺ¨λΈ νμ΅μ κ΅¬λ¦ λ°μ΄ν°μ v1 (GPT4ALL)μ μ¬μ©νμ¬ μ§ννμ΅λλ€. μ΄ 5 epoch νμ΅νμμΌλ©°, A100 80GB 4λλ₯Ό μ¬μ©νμ΅λλ€.
- λ€μ λͺ λ Ήμ΄λ₯Ό ν΅ν΄ νμν ν¨ν€μ§λ₯Ό μ€μΉ:
pip install -r requirements.txt- λ§μ½ bitsandbytesκ° μλνμ§ μλλ€λ©΄, μμ€μμ μ§μ μ€μΉνμΈμ. μλμ° μ¬μ©μλ λ€μμ μ€λͺ μλ₯Ό μ°Έμ‘°νμΈμ.
μ΄ μ½λλ Polyglot λͺ¨λΈμ Parameter-Efficient Fine-Tuning (PEFT)μ μ μ©νκ³ , ν둬ννΈ κ΅¬μ± λ° ν ν¬λμ΄μ§μ κ΄λ ¨λ μ½λκ° λ€μ΄μλ νμΌμ λλ€.
μ¬μ© μμ:
python finetune_polyglot.py \
--base_model='EleutherAI/polyglot-ko-12.8b' \
--data_path='./data/kullm-v2.jsonl'
λ€μκ³Ό κ°μ΄ νμ΄νΌνλΌλ―Έν°λ₯Ό μ‘°μ κ°λ₯ν©λλ€:
python -m torch.distributed.launch --master_port=34322 --nproc_per_node 4 finetune_polyglot.py \
--fp16 \
--base_model 'EleutherAI/polyglot-ko-12.8b' \
--data_path data/kullm-v2.jsonl \
--output_dir ckpt/$SAVE_DIR \
--prompt_template_name kullm \
--batch_size 128 \
--micro_batch_size 4 \
--num_epochs $EPOCH \
--learning_rate $LR \
--cutoff_len 512 \
--val_set_size 2000 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules "[query_key_value, xxx]" \
--train_on_inputs \
--logging_steps 1 \
--eval_steps 40 \
--weight_decay 0. \
--warmup_steps 0 \
--warmup_ratio 0.1 \
--lr_scheduler_type "cosine" \
--group_by_length- λν νκ° λ©νΈλ¦ (Dialogue Evaluation Metric)μ μ¬μ©νμ¬ λͺ¨λΈ κ° νκ΅μ΄ λνλ₯Ό νκ° νμ΅λλ€. λν νκ° λ©νΈλ¦μ G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Yang Liu. et. al. 2023)κ³Ό USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation (Shikib Mehri. et. al. 2020)μ νμ©νμ¬ νκ° Promptλ₯Ό ꡬμ±νμ΅λλ€.
- νκ° λͺ¨λΈμ GPT-4λ₯Ό μ¬μ©νμκ³ , νκ° λ°μ΄ν°μ
μ yizhongw/self-instructμ ν΄λ¨Ό νκ° λ°μ΄ν°μ
μΈ
user_oriented_instructions.jsonlμ deeplλ‘ λ²μν λ°μ΄ν°μ μ μ¬μ©νμμ΅λλ€. - ν΄λΉ λ°μ΄ν°μ
μ
user_oriented_instructions_eval.jsonlμ μ μ₯λμ΄ μμ΅λλ€.
- κ·Έλνμ κ°μ 0-100μ μΌλ‘ μ€μΌμΌλ§ λμμ΅λλ€.
| Type | Base-model | Model | μ΄ν΄ κ°λ₯μ± (0 - 1) | μμ°μ€λ¬μ (1 - 3) | λ§₯λ½ μ μ§ (1 - 3) | ν₯λ―Έλ‘κΈ° (1 - 3) | μ§μμ΄ μ¬μ© (0-1) | μ λ°μ μΈ νμ§ (1-5) |
|---|---|---|---|---|---|---|---|---|
| Closed | GPT3.5-turbo | GPT-3.5 | 0.980 | 2.806 | 2.849 | 2.056 | 0.917 | 3.905 |
| Closed | GPT-4 | GPT-4 | 0.984 | 2.897 | 2.944 | 2.143 | 0.968 | 4.083 |
| Open | Polyglot-ko-12.8b | KoAlpaca v1.1 | 0.651 | 1.909 | 1.901 | 1.583 | 0.385 | 2.575 |
| Open | LLaMA-7b | koVicuna | 0.460 | 1.583 | 1.726 | 1.528 | 0.409 | 2.440 |
| Open | Polyglot-ko-12.8b | KULLM v2 | 0.742 | 2.083 | 2.107 | 1.794 | 0.548 | 3.036 |
λ μ¬λ κ°μ λνκ° μ£Όμ΄μ§λλ€. λ€μμ μ§μλ¬Έ(Instruction), μ
λ ₯(Input)μ λ°κ² λ κ²μ
λλ€. κ·Έλ¦¬κ³ μ§μλ¬Έκ³Ό μ
λ ₯μ λν μλ΅(Response)μ΄ μ μλ©λλ€.
λΉμ μ μμ
μ μλ΅μ νκ° λ¨κ³μ λ°λΌ μλ΅μ νκ°νλ κ²μ
λλ€.
μ΄ νκ° κΈ°μ€μ κΌΌκΌΌν μ½κ³ μ΄ν΄νλ κ²μ΄ μ€μν©λλ€. νκ°νλ λμ μ΄ λ¬Έμλ₯Ό κ³μ μ΄μ΄λκ³ νμν λ μ°Έμ‘°ν΄ μ£ΌμΈμ.
νκ° κΈ°μ€:
- μ΄ν΄ κ°λ₯μ± (0 - 1): Inputμ κΈ°λ°νμ¬ Responseλ₯Ό μ΄ν΄ ν μ μλμ?
- μμ°μ€λ¬μ (1 - 3): μ¬λμ΄ μμ°μ€λ½κ² λ§ν λ²ν Instruction μΈκ°μ?
- λ§₯λ½ μ μ§ (1 - 3): Inputμ κ³ λ €νμ λ Responseκ° λ§₯λ½μ μ μ§νλμ?
- ν₯λ―Έλ‘κΈ° (1 - 3): Responseκ° μ§λ£¨νκ°μ, μλλ©΄ ν₯λ―Έλ‘μ΄κ°μ?
- Instruction μ¬μ© (0 - 1): Instructionμ κΈ°λ°νμ¬ Responseλ₯Ό μμ± νλμ?
- μ λ°μ μΈ νμ§ (1 - 5): μμ λ΅λ³μ λ°νμΌλ‘ μ΄ λ°μΈμ μ λ°μ μΈ νμ§μ λν μΈμμ μ΄λ€κ°μ?
νκ° λ¨κ³:
1. Instruction, Input, κ·Έλ¦¬κ³ Responseμ μ£ΌμκΉκ² μ½μ΅λλ€.
2. μμ νκ° κΈ°μ€μ λ°λΌ Responseμ νκ°ν©λλ€.
Instruction:
{{instruction}}
Input:
{{input}}
Response:
{{response}}
Result
- μ΄ν΄ κ°λ₯μ± (0 - 1):
- μμ°μ€λ¬μ (1 - 3):
- λ§₯λ½ μ μ§ (1 - 3):
- ν₯λ―Έλ‘κΈ° (1 - 3):
- Instruction μ¬μ© (0 - 1):
- μ λ°μ μΈ νμ§ (1 - 5):
Please cite the repo if you use the data or code in this repo.
@inproceedings{lee2023kullm,
title={KULLM: Learning to Construct Korean Instruction-following Large Language Models},
author={Lee, SeungJun and Lee, Taemin and Lee, Jeongwoo and Jang, Yoona and Lim, Heuiseok},
booktitle={Annual Conference on Human and Language Technology},
pages={196--202},
year={2023},
organization={Human and Language Technology}
}
@misc{kullm,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KULLM: Korea University Large Language Model Project},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/kullm}},
}


