Model
The choice of model determines representational capacity, adapter size, memory footprint, and convergence behavior. It directly influences achievable accuracy, training stability, and hardware requirements. Selecting an appropriate model for fine-tuning requires balancing performance, computational cost, and hardware constraints. Efficiency is maximized when the model architecture aligns with the specific requirements of the target task.
Objective Definition
Clarity on the purpose and intended output determines the necessary architecture. Table below summarizes the most common use cases and their corresponding architectures.
| Purpose | Architecture | Use Case | Example | Notes |
|---|---|---|---|---|
| Chat | Decoder | Conversational Agent, Tool Calling, Thinking | Llama-3.1-8B-Instruct | Use instruction-tuned (Instruct) models that are optimized to follow prompts, maintain dialogue context, and produce well-structured, aligned responses suitable for interactive and agentic workflows. |
| Text Generation | Decoder | Code Completion, Sentence Completion (Used in email clients or writing tools) | Llama-3.1-8B | General reasoning, logic tasks, and standard coding assistance. |
| Similarity | Bi-Encoder | Similarity Search, RAG embeddings | bge-m3 | Produces fixed-size vector embeddings optimized for cosine or dot-product similarity. Not suitable for text generation or conversational tasks. Designed for high-throughput, low-latency embedding. |
| Reranking | Cross-Encoder | Search result reranking, retrieval refinement, relevance scoring, RAG ranking | bge-reranker-v2-m3 | Scores query & document pairs jointly, producing highly accurate relevance ranking. More compute-intensive than embedding models, typically applied to a small candidate set after initial retrieval. |
| Multi-label Classification | Encoder | Tagging, topic assignment, content moderation, intent detection | bert-base-uncased, roberta-base | Predicts multiple labels per input simultaneously (Multi-Hot Vector). Uses sigmoid activation and thresholding instead of softmax. Suitable when labels are non-exclusive and may overlap. |
| Multi-class Classification | Encoder | Intent classification, document categorization, sentiment analysis | bert-base-uncased, roberta-base | Outputs a one-hot vector with a single active class. Uses softmax activation and cross-entropy loss. Suitable when classes are mutually exclusive. |
Base vs. Instruct Model Selection
- Base Models: Trained on raw text for next-token prediction. These serve as the foundation when introducing a model to entirely new languages or highly specialized technical vocabularies.
- Instruct Models: Pre-aligned to follow directions. These are preferable for refining specific behaviors, adjusting response tones, or enforcing strict output formats (e.g., JSON).
Parameter Size and Hardware Requirements
Model size influences the use case.
| Model Size | Optimal Use Case |
|---|---|
| 1B - 3B | Edge devices, mobile applications, simple classification. |
| 7B - 8B | General reasoning, logic tasks, and standard coding assistance. |
| 14B - 30B | Complex domain-specific logic (medical, legal, or scientific). |
Model size, measured in billions of parameters (B), dictates the required VRAM to fine-tune the model.
| Model Parameters | QLoRA (4-bit) VRAM | LoRA (16-bit) VRAM |
|---|---|---|
| 3B | ~3.5 GB | ~8 GB |
| 7B | ~5 GB | ~19 GB |
| 8B | ~6 GB | ~22 GB |
| 9B | ~6.5 GB | ~24 GB |
| 11B | ~7.5 GB | ~29 GB |
| 14B | ~8.5 GB | ~33 GB |
| 27B | ~22 GB | ~64 GB |
| 32B | ~26 GB | ~76 GB |
Implementation of LoRA (Low-Rank Adaptation) or QLoRA enables the fine-tuning of 7B+ models on consumer-grade hardware, such as a single 24GB GPU.
Selection Criteria
Key metrics on model repositories provide essential guidance:
- Context Window: Determines the maximum data volume processed in a single pass. Long-form document analysis requires models with 32k to 128k context windows. Larger context windows significantly increase VRAM usage, as memory consumption scales with sequence length due to attention key/value caches, directly impacting batch size, concurrency, and hardware requirements.
Protean AI provides deployment recommendation and suggests VRAM usage based on context window size. Protean AI also allows the possibility to configure the context window size during the deployment.
- Licensing: Compliance with commercial requirements (e.g., Apache 2.0, MIT, or Llama 3 Community License) is mandatory for enterprise deployment.
- Benchmarks: MMLU (General knowledge) and HumanEval (Coding) scores offer standardized performance comparisons.
Implementation Checklist
- Identify Task Category:
- Audit Hardware:
- Select Scale: Determine if a 7B/8B model provides the necessary balance of speed and intelligence.
- Verify Licensing: Ensure the model permits the intended commercial or research use.
- Baseline Testing: Evaluate the model's performance without fine-tuning to establish a performance floor.