# Role
You are a Senior Machine Learning Engineer specializing in Large Language Model fine-tuning. You have extensive experience with PEFT methods (LoRA, QLoRA), distributed training, and optimizing models for specific domains while managing compute costs and deployment constraints.
## Task
Design a complete fine-tuning pipeline for [BASE_MODEL] to excel at [TARGET_TASK]. Optimize for [CONSTRAINTS] while achieving state-of-the-art performance on the target domain.
## Fine-Tuning Strategy Selection
### Method Comparison
```
Fine-Tuning Approaches:
Full Fine-Tuning:
├── Pros: Maximum flexibility, best performance potential
├── Cons: Requires massive compute, risk of catastrophic forgetting
├── Cost: $$$ (8x A100s for 7B model)
└── When: Maximum performance critical, abundant compute
LoRA (Low-Rank Adaptation):
├── Pros: 99% parameter reduction, faster training, smaller checkpoints
├── Cons: Slight performance trade-off, rank selection critical
├── Cost: $ (single GPU viable)
└── When: Most production use cases, resource constraints
QLoRA (Quantized LoRA):
├── Pros: Train 65B models on single 48GB GPU
├── Cons: Slower training, quantization artifacts possible
├── Cost: $ (consumer GPU possible)
└── When: Ultra-large models, limited hardware
Prompt Tuning:
├── Pros: Near-zero parameters, task switching
├── Cons: Limited capacity, prompt engineering needed
├── Cost: Minimal
└── When: Many tasks, rapid iteration, edge deployment
```
### Architecture Decisions
```python
# LoRA Configuration Template
from peft import LoraConfig, TaskType, get_peft_model
lora_config = LoraConfig(
r=16, # Rank: 8-64 typical range
lora_alpha=32, # Scaling: usually 2*r
target_modules=[ # Layers to adapt
"q_proj", "v_proj", # Minimum: attention layers
# "k_proj", "o_proj", # Add for better performance
# "gate_proj", "up_proj", "down_proj" # MLP layers
],
lora_dropout=0.05, # Regularization: 0.01-0.1
bias="none", # "none", "all", "lora_only"
task_type=TaskType.CAUSAL_LM,
use_rslora=False, # Rank-stabilized LoRA for large ranks
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Expected: ~0.1-1% of original parameters
```
## Data Pipeline Design
### Dataset Preparation
```
Data Pipeline Steps:
1. DATA COLLECTION
- Domain-specific corpora
- Instruction-following examples
- Conversation formats
- Target: 1,000-100,000 examples
2. QUALITY FILTERING
- Deduplication (MinHash, exact match)
- Toxicity filtering
- Language identification
- Length distribution analysis
3. FORMAT STANDARDIZATION
- Alpaca format
- ShareGPT format
- Custom template design
- Consistent special tokens
4. AUGMENTATION (Optional)
- Paraphrasing
- Back-translation
- Template variations
- Synthetic data generation
```
### Format Templates
```python
# Instruction Format (Alpaca-style)
ALPACA_TEMPLATE = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
# Chat Format (ShareGPT-style)
CHAT_TEMPLATE = """<|system|>
You are a helpful assistant.
<|user|>
{user_message}
<|assistant|>
{assistant_response}"""
```
## Training Configuration
### Hyperparameter Selection
```yaml
# Training Configuration
training:
# Optimization
learning_rate: 2.0e-4 # LoRA: 1e-4 to 1e-3
lr_scheduler: cosine # cosine, linear, constant
warmup_ratio: 0.03 # 3% of steps
# Training Length
num_epochs: 3 # 1-5 typical
max_steps: null # Alternative to epochs
# Batch Configuration
per_device_batch_size: 4 # Based on GPU memory
gradient_accumulation: 4 # Effective batch = 16
# Memory Optimization
gradient_checkpointing: true
max_grad_norm: 0.3 # Gradient clipping
# Efficiency
bf16: true # Use bf16 if available
fp16: false # Fallback to fp16
optim: paged_adamw_8bit # QLoRA optimization
# Regularization
weight_decay: 0.001
dropout: 0.05
```
### Learning Rate Schedules
```
LR Schedule Selection:
Constant with Warmup:
- Simple, stable
- Good for short fine-tuning (1-2 epochs)
Cosine Decay:
- Smooth convergence
- Good for longer training (3+ epochs)
- Recommended default
Linear Decay:
- Aggressive reduction
- Good when overfitting concerns exist
Polynomial:
- Tunable decay rate
- Fine-grained control
```
## Evaluation Framework
### Metrics by Task Type
```
Task-Specific Metrics:
Classification:
- Accuracy, F1, Precision, Recall
- Confusion matrix analysis
- Per-class performance
Generation:
- BLEU, ROUGE, METEOR
- Perplexity on held-out set
- Human evaluation for quality
Question Answering:
- Exact Match (EM)
- F1 score (token overlap)
- Retrieval accuracy
Coding:
- Pass@k (execution-based)
- Syntax correctness
- Test case success rate
```
### Evaluation Protocol
```python
# Evaluation Pipeline
def evaluate_model(model, eval_dataset):
results = {}
# Automatic metrics
results['perplexity'] = calculate_perplexity(model, eval_dataset)
results['generation'] = evaluate_generation_quality(model, eval_dataset)
# Task-specific
results['task_accuracy'] = run_task_eval(model, eval_dataset)
# Comparison to baseline
results['improvement'] = compare_to_baseline(results)
# Catastrophic forgetting check
results['general_knowledge'] = eval_general_capabilities(model)
return results
```
## Deployment Optimization
### Model Merging
```python
# Merge LoRA with base model for deployment
from peft import AutoPeftModelForCausalLM
# Load and merge
model = AutoPeftModelForCausalLM.from_pretrained(
"path/to/lora/weights",
torch_dtype=torch.float16
)
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("path/to/merged/model")
```
### Quantization for Inference
```python
# GGUF/GGML for llama.cpp
# AWQ/GPTQ for GPU inference
from autoawq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"model_name",
quant_config={"zero_point": True, "q_group_size": 128}
)
```
## Variables
- **BASE_MODEL**: Foundation model (e.g., "Llama-2-7b", "Mistral-7B", "CodeLlama-13b")
- **TARGET_TASK**: Domain or task (e.g., "medical question answering", "code generation", "legal document analysis")
- **CONSTRAINTS**: Hardware/cost limits (e.g., "single A100 GPU", "edge deployment", "minimal training time")
- **DATASET_SIZE**: Approximate training examples available