Skip to content

Dalsontimes/KULLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

134 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NLP Logo

Update Logs



☁️ KULLM (ꡬ름): Korea University Large Language Model

KULLM(ꡬ름)은 κ³ λ €λŒ€ν•™κ΅ NLP & AI 연ꡬ싀과 HIAI μ—°κ΅¬μ†Œκ°€ κ°œλ°œν•œ ν•œκ΅­μ–΄ Large Language Model (LLM) μž…λ‹ˆλ‹€.

ꡬ름 ν”„λ‘œμ νŠΈλŠ” ν•œκ΅­μ–΄ λͺ¨λΈ 뿐만 μ•„λ‹ˆλΌ, 데이터 μ…‹κΉŒμ§€ μ „λ©΄ κ³΅κ°œν•˜μ—¬ ν•œκ΅­μ–΄ LLM μƒνƒœκ³„μ— κΈ°μ—¬ν•˜κ³ μž ν•˜μ˜€μŠ΅λ‹ˆλ‹€.


Example


Backbone Model?: Polyglot-ko

KULLM(ꡬ름)은 Backbone Model둜 Polyglot-ko을 μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅μ„ μ§„ν–‰ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

  1. Polyglot-ko 5.8B 기반-v2 -> πŸ€— nlpai-lab/kullm-polyglot-5.8b-v2
  2. Polyglot-ko 12.8B 기반-v2 -> πŸ€— nlpai-lab/kullm-polyglot-12.8b-v2
  3. Polyglot-ko 12.8B 기반-v1 -> πŸ€— metterian/kullm-polyglot-12.8b-v1
    • 데이터셋 v1: GPT4ALL

Meta의 LLaMA λͺ¨λΈμ„ 백본으둜 λ§Œλ“  λͺ¨λΈμ€ ν…ŒμŠ€νŠΈ κ²°κ³Ό ν•œκ΅­μ–΄ μ„±λŠ₯이 μ’‹μ§€ λͺ»ν•˜μ—¬ κ³΅κ°œν•˜μ§€ μ•ŠκΈ°λ‘œ ν–ˆμŠ΅λ‹ˆλ‹€. μΆ”ν›„ μ—¬λŸ¬ 쒋은 ν•œκ΅­μ–΄ μ„±λŠ₯을 λ³΄μ—¬μ£ΌλŠ” LLM λͺ¨λΈμ„ ν•™μŠ΅ν•˜μ—¬ κ³΅κ°œν•˜κ³ μž ν•©λ‹ˆλ‹€.


KULLM λͺ¨λΈ μ‹€ν–‰ μ˜ˆμ‹œ μ½”λ“œ

Huggingface Pipeline으둜 μ‹€ν–‰

  • μ΅œμ‹ λ²„μ „ torch / HF 라이브러리 μ„€μΉ˜
pip install -U torch transformers tokenizers accelerate

μ•„λž˜ 예제 μ½”λ“œλ‘œ μ‹€ν–‰ν•΄λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from utils.prompter import Prompter

MODEL = "nlpai-lab/kullm-polyglot-5.8b-v2"

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(device=f"cuda", non_blocking=True)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=MODEL, device=0)

prompter = Prompter("kullm")


def infer(instruction="", input_text=""):
    prompt = prompter.generate_prompt(instruction, input_text)
    output = pipe(prompt, max_length=512, temperature=0.2, num_beams=5, eos_token_id=2)
    s = output[0]["generated_text"]
    result = prompter.get_response(s)

    return result


result = infer(input_text="κ³ λ €λŒ€ν•™κ΅μ— λŒ€ν•΄μ„œ μ•Œλ €μ€˜")
print(result)
# 'κ³ λ €λŒ€ν•™κ΅μ— λŒ€ν•΄ κΆκΈˆν•œ 점이 μžˆμœΌμ‹œλ©΄ μ–Έμ œλ“ μ§€ λ¬Έμ˜ν•΄ μ£Όμ„Έμš”. κ³ λ €λŒ€ν•™κ΅λŠ” ν•œκ΅­μ—μ„œ κ°€μž₯ 였래되고 κΆŒμœ„ μžˆλŠ” λŒ€ν•™κ΅ 쀑 ν•˜λ‚˜λ‘œ, κ³ λ €λŒ€ν•™κ΅μ˜ μ—­μ‚¬λŠ” ν•œκ΅­μ˜ 역사와 ν•¨κ»˜ν•΄ μ™”μŠ΅λ‹ˆλ‹€. κ³ λ €λŒ€ν•™κ΅λŠ” 학문적 μš°μˆ˜μ„±μ„ μΆ”κ΅¬ν•˜λŠ” λ™μ‹œμ— μ‚¬νšŒμ  μ±…μž„μ„ λ‹€ν•˜κΈ° μœ„ν•΄ μ΅œμ„ μ„ λ‹€ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. κ³ λ €λŒ€ν•™κ΅λŠ” 학생, κ΅μˆ˜μ§„, ꡐ직원을 μœ„ν•œ λ‹€μ–‘ν•œ ν”„λ‘œκ·Έλž¨κ³Ό 지원을 μ œκ³΅ν•˜λŠ” κ²ƒμœΌλ‘œ 유λͺ…ν•©λ‹ˆλ‹€. κ³ λ €λŒ€ν•™κ΅λŠ” ν•œκ΅­μ˜ μ •μΉ˜, 경제, μ‚¬νšŒ λΆ„μ•Όμ—μ„œ μ€‘μš”ν•œ 역할을 λ‹΄λ‹Ήν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. κ³ λ €λŒ€ν•™κ΅μ— λŒ€ν•΄ 더 μžμ„Ένžˆ μ•Œκ³  μ‹ΆμœΌμ‹ κ°€μš”?'

Dataset

ꡬ름 데이터셋 v2

HuggingFace Datasets

ꡬ름 데이터셋 v2λŠ” GPT-4-LLM, Vicuna, 그리고 Databricks의 Dolly 데이터셋을 λ³‘ν•©ν•œ κ²ƒμž…λ‹ˆλ‹€. 이 λͺ¨λ“  데이터셋은 DeepL을 μ΄μš©ν•˜μ—¬ ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

GPT4ALL은 instruction tuned assistant-style language model이며, Vicuna와 Dolly 데이터셋은 λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 문제λ₯Ό ν•΄κ²°ν•˜λŠ” 데 ν™œμš©λ©λ‹ˆλ‹€. 특히, DollyλŠ” instruction/response fine tuning recordsλ₯Ό ν›ˆλ ¨ λ°μ΄ν„°λ‘œ μ‚¬μš©ν•œ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.

from datasets import load_dataset

ds = load_dataset("nlpai-lab/kullm-v2", split="train")
ds
DatasetDict({
    train: Dataset({
        features: ['id', 'instruction', 'input', 'output'],
        num_rows: 152630
    })
})

ꡬ름 데이터셋 v1

ꡬ름 데이터셋 v1은 GPT4ALL을 기반으둜 ν•©λ‹ˆλ‹€.

데이터셋 μ˜ˆμ‹œ

GPT4ALL 데이터셋은 λ‹€μŒκ³Ό 같이 Instruct λΆ€λΆ„κ³Ό Input, 그리고 Output λΆ€λΆ„μœΌλ‘œ κ΅¬μ„±λ˜μ–΄μžˆμŠ΅λ‹ˆλ‹€.

{
    "id": "user_oriented_task_235",
    "motivation_app": "Yelp",
    "instruction": "μ „λ¬Έ 뢄야에 따라 λ ˆμŠ€ν† λž‘, ν™ˆ μ„œλΉ„μŠ€, μžλ™μ°¨ μ„œλΉ„μŠ€, 기타 쀑 ν•˜λ‚˜λ‘œ λΉ„μ¦ˆλ‹ˆμŠ€λ₯Ό λΆ„λ₯˜ν•©λ‹ˆλ‹€.",
    "instances": [
        {
            "input": "견적을 λ°›μœΌλ €λ©΄ 650-636-4884둜 μ „ν™”ν•˜κ±°λ‚˜ μ›Ήμ‚¬μ΄νŠΈλ₯Ό λ°©λ¬Έν•˜μ„Έμš”. 이 λ§€μž₯은 μ‹ ν’ˆ 타이어 및 일반 μžλ™μ°¨ 수리λ₯Ό μ „λ¬ΈμœΌλ‘œ ν•©λ‹ˆλ‹€. λͺ¨λ“  타이어λ₯Ό 자체적으둜 λ³΄μœ ν•˜κ³  있으며 μ˜ˆμ‚°μ΄λ‚˜ μ°¨λŸ‰ νŠΉμ„±μ— λ§žλŠ” λ‹€μ–‘ν•œ 타이어λ₯Ό λ³΄μœ ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. μ–΄λ–€ 타이어가 ν•„μš”ν•œμ§€ 잘 λͺ¨λ₯΄μ‹œκ² λ‹€λ©΄ μ „λ¬Έκ°€κ°€ μƒμ£Όν•˜μ—¬ 고객의 μš”κ΅¬μ— κ°€μž₯ μ ν•©ν•œ 타이어λ₯Ό 선택할 수 μžˆλ„λ‘ λ„μ™€λ“œλ¦½λ‹ˆλ‹€. λ˜ν•œ μƒμš©μ°¨ 타이어도 μ·¨κΈ‰ν•˜κ³  μžˆμ–΄ λ‹€μ–‘ν•œ μ°¨λŸ‰μ— λ§žλŠ” 타이어λ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.",
            "output": "Auto Services"
        }
    ]
},

ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­λœ 데이터셋은 kullm-v2.jsonl에 μ €μž₯λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.


Training with LoRA

KULLM은 ν•œκ΅­μ–΄ λͺ¨λΈλ‘œ Polyglot 12.8B λͺ¨λΈμ„ Low Rank Adaptation (LoRA)λ₯Ό μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

λͺ¨λΈ ν•™μŠ΅μ€ A100 80GB 4λŒ€λ‘œ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. ν•™μŠ΅μ— μ‚¬μš©ν•œ μ½”λ“œλŠ” tloen/alpaca-lora을 기반으둜 μ‚¬μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

KULLM v2

πŸ€— Huggingface Repo: https://huggingface.co/nlpai-lab/kullm-polyglot-12.8b-v2

λͺ¨λΈ ν•™μŠ΅μ€ ꡬ름 데이터셋 v2 (GPT4ALL, Dolly, Vicuna)을 μ‚¬μš©ν•˜μ—¬ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. 총 8 epoch ν•™μŠ΅ν•˜μ˜€μœΌλ©°, A100 80GB 4λŒ€λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

KULLM v1

πŸ€— Huggingface Repo: πŸ€— https://huggingface.co/metterian/kullm-polyglot-12.8b-v1

λͺ¨λΈ ν•™μŠ΅μ€ ꡬ름 데이터셋 v1 (GPT4ALL)을 μ‚¬μš©ν•˜μ—¬ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. 총 5 epoch ν•™μŠ΅ν•˜μ˜€μœΌλ©°, A100 80GB 4λŒ€λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

Dependency

  1. λ‹€μŒ λͺ…λ Ήμ–΄λ₯Ό 톡해 ν•„μš”ν•œ νŒ¨ν‚€μ§€λ₯Ό μ„€μΉ˜:
pip install -r requirements.txt
  1. λ§Œμ•½ bitsandbytesκ°€ μž‘λ™ν•˜μ§€ μ•ŠλŠ”λ‹€λ©΄, μ†ŒμŠ€μ—μ„œ 직접 μ„€μΉ˜ν•˜μ„Έμš”. μœˆλ„μš° μ‚¬μš©μžλŠ” λ‹€μŒμ˜ μ„€λͺ…μ„œλ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

Training (finetune_polyglot.py)

이 μ½”λ“œλŠ” Polyglot λͺ¨λΈμ— Parameter-Efficient Fine-Tuning (PEFT)을 μ μš©ν•˜κ³ , ν”„λ‘¬ν”„νŠΈ ꡬ성 및 ν† ν¬λ‚˜μ΄μ§•μ— κ΄€λ ¨λœ μ½”λ“œκ°€ λ“€μ–΄μžˆλŠ” νŒŒμΌμž…λ‹ˆλ‹€.

μ‚¬μš© μ˜ˆμ‹œ:

python finetune_polyglot.py \
--base_model='EleutherAI/polyglot-ko-12.8b' \
--data_path='./data/kullm-v2.jsonl'

λ‹€μŒκ³Ό 같이 ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ₯Ό μ‘°μ • κ°€λŠ₯ν•©λ‹ˆλ‹€:

python -m torch.distributed.launch  --master_port=34322  --nproc_per_node 4 finetune_polyglot.py \
    --fp16 \
    --base_model 'EleutherAI/polyglot-ko-12.8b' \
    --data_path data/kullm-v2.jsonl \
    --output_dir ckpt/$SAVE_DIR \
    --prompt_template_name kullm \
    --batch_size 128 \
    --micro_batch_size 4 \
    --num_epochs $EPOCH \
    --learning_rate $LR \
    --cutoff_len 512 \
    --val_set_size 2000 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lora_target_modules "[query_key_value, xxx]" \
    --train_on_inputs \
    --logging_steps 1 \
    --eval_steps 40 \
    --weight_decay 0. \
    --warmup_steps 0 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "cosine" \
    --group_by_length

Evaluation

  • λŒ€ν™” 평가 λ©”νŠΈλ¦­ (Dialogue Evaluation Metric)을 μ‚¬μš©ν•˜μ—¬ λͺ¨λΈ κ°„ ν•œκ΅­μ–΄ λŒ€ν™”λ₯Ό 평가 ν–ˆμŠ΅λ‹ˆλ‹€. λŒ€ν™” 평가 λ©”νŠΈλ¦­μ€ G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Yang Liu. et. al. 2023)κ³Ό USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation (Shikib Mehri. et. al. 2020)을 ν™œμš©ν•˜μ—¬ 평가 Promptλ₯Ό κ΅¬μ„±ν–ˆμŠ΅λ‹ˆλ‹€.
  • 평가 λͺ¨λΈμ€ GPT-4λ₯Ό μ‚¬μš©ν•˜μ˜€κ³ , 평가 데이터셋은 yizhongw/self-instruct의 휴먼 평가 데이터셋인 user_oriented_instructions.jsonl을 deepl둜 λ²ˆμ—­ν•œ 데이터셋을 μ‚¬μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • ν•΄λ‹Ή 데이터셋은 user_oriented_instructions_eval.jsonl에 μ €μž₯λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

eval_result

  • κ·Έλž˜ν”„μ˜ 값은 0-100점으둜 μŠ€μΌ€μΌλ§ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

LLM Inference Results for Korean Evaluation Set

Type Base-model Model 이해 κ°€λŠ₯μ„± (0 - 1) μžμ—°μŠ€λŸ¬μ›€ (1 - 3) λ§₯락 μœ μ§€ (1 - 3) ν₯λ―Έλ‘­κΈ° (1 - 3) μ§€μ‹œμ–΄ μ‚¬μš© (0-1) μ „λ°˜μ μΈ ν’ˆμ§ˆ (1-5)
Closed GPT3.5-turbo GPT-3.5 0.980 2.806 2.849 2.056 0.917 3.905
Closed GPT-4 GPT-4 0.984 2.897 2.944 2.143 0.968 4.083
Open Polyglot-ko-12.8b KoAlpaca v1.1 0.651 1.909 1.901 1.583 0.385 2.575
Open LLaMA-7b koVicuna 0.460 1.583 1.726 1.528 0.409 2.440
Open Polyglot-ko-12.8b KULLM v2 0.742 2.083 2.107 1.794 0.548 3.036

Prompt

두 μ‚¬λžŒ κ°„μ˜ λŒ€ν™”κ°€ μ£Όμ–΄μ§‘λ‹ˆλ‹€. λ‹€μŒμ˜ μ§€μ‹œλ¬Έ(Instruction), μž…λ ₯(Input)을 λ°›κ²Œ 될 κ²ƒμž…λ‹ˆλ‹€. 그리고 μ§€μ‹œλ¬Έκ³Ό μž…λ ₯에 λŒ€ν•œ 응닡(Response)이 μ œμ‹œλ©λ‹ˆλ‹€.
λ‹Ήμ‹ μ˜ μž‘μ—…μ€ 응닡을 평가 단계에 따라 응닡을 ν‰κ°€ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.
이 평가 기쀀을 꼼꼼히 읽고 μ΄ν•΄ν•˜λŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€. ν‰κ°€ν•˜λŠ” λ™μ•ˆ 이 λ¬Έμ„œλ₯Ό 계속 열어두고 ν•„μš”ν•  λ•Œ μ°Έμ‘°ν•΄ μ£Όμ„Έμš”.

평가 κΈ°μ€€:
- 이해 κ°€λŠ₯μ„± (0 - 1): Input에 κΈ°λ°˜ν•˜μ—¬ Responseλ₯Ό 이해 ν•  수 μžˆλ‚˜μš”?
- μžμ—°μŠ€λŸ¬μ›€ (1 - 3): μ‚¬λžŒμ΄ μžμ—°μŠ€λŸ½κ²Œ 말할 λ²•ν•œ Instruction μΈκ°€μš”?
- λ§₯락 μœ μ§€ (1 - 3): Input을 κ³ λ €ν–ˆμ„ λ•Œ Responseκ°€ λ§₯락을 μœ μ§€ν•˜λ‚˜μš”?
- ν₯λ―Έλ‘­κΈ° (1 - 3): Responseκ°€ μ§€λ£¨ν•œκ°€μš”, μ•„λ‹ˆλ©΄ ν₯λ―Έλ‘œμš΄κ°€μš”?
- Instruction μ‚¬μš© (0 - 1): Instruction에 κΈ°λ°˜ν•˜μ—¬ Responseλ₯Ό 생성 ν–ˆλ‚˜μš”?
- μ „λ°˜μ μΈ ν’ˆμ§ˆ (1 - 5): μœ„μ˜ 닡변을 λ°”νƒ•μœΌλ‘œ 이 λ°œμ–Έμ˜ μ „λ°˜μ μΈ ν’ˆμ§ˆμ— λŒ€ν•œ 인상은 μ–΄λ–€κ°€μš”?

평가 단계:
1. Instruction, Input, 그리고 Response을 주의깊게 μ½μŠ΅λ‹ˆλ‹€.
2. μœ„μ˜ 평가 기쀀에 따라 Response을 ν‰κ°€ν•©λ‹ˆλ‹€.

Instruction:
{{instruction}}

Input:
{{input}}

Response:
{{response}}


Result
- 이해 κ°€λŠ₯μ„± (0 - 1):
- μžμ—°μŠ€λŸ¬μ›€ (1 - 3):
- λ§₯락 μœ μ§€ (1 - 3):
- ν₯λ―Έλ‘­κΈ° (1 - 3):
- Instruction μ‚¬μš© (0 - 1):
- μ „λ°˜μ μΈ ν’ˆμ§ˆ (1 - 5):


Citation

Please cite the repo if you use the data or code in this repo.

@inproceedings{lee2023kullm,
  title={KULLM: Learning to Construct Korean Instruction-following Large Language Models},
  author={Lee, SeungJun and Lee, Taemin and Lee, Jeongwoo and Jang, Yoona and Lim, Heuiseok},
  booktitle={Annual Conference on Human and Language Technology},
  pages={196--202},
  year={2023},
  organization={Human and Language Technology}
}
@misc{kullm,
  author = {NLP & AI Lab and Human-Inspired AI research},
  title = {KULLM: Korea University Large Language Model Project},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/nlpai-lab/kullm}},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors