Skip to main content

Model

The choice of model determines representational capacity, adapter size, memory footprint, and convergence behavior. It directly influences achievable accuracy, training stability, and hardware requirements. Selecting an appropriate model for fine-tuning requires balancing performance, computational cost, and hardware constraints. Efficiency is maximized when the model architecture aligns with the specific requirements of the target task.

Objective Definition

Clarity on the purpose and intended output determines the necessary architecture. Table below summarizes the most common use cases and their corresponding architectures.

PurposeArchitectureUse CaseExampleNotes
ChatDecoderConversational Agent, Tool Calling, ThinkingLlama-3.1-8B-InstructUse instruction-tuned (Instruct) models that are optimized to follow prompts, maintain dialogue context, and produce well-structured, aligned responses suitable for interactive and agentic workflows.
Text GenerationDecoderCode Completion, Sentence Completion (Used in email clients or writing tools)Llama-3.1-8BGeneral reasoning, logic tasks, and standard coding assistance.
SimilarityBi-EncoderSimilarity Search, RAG embeddingsbge-m3Produces fixed-size vector embeddings optimized for cosine or dot-product similarity. Not suitable for text generation or conversational tasks. Designed for high-throughput, low-latency embedding.
RerankingCross-EncoderSearch result reranking, retrieval refinement, relevance scoring, RAG rankingbge-reranker-v2-m3Scores query & document pairs jointly, producing highly accurate relevance ranking. More compute-intensive than embedding models, typically applied to a small candidate set after initial retrieval.
Multi-label ClassificationEncoderTagging, topic assignment, content moderation, intent detectionbert-base-uncased, roberta-basePredicts multiple labels per input simultaneously (Multi-Hot Vector). Uses sigmoid activation and thresholding instead of softmax. Suitable when labels are non-exclusive and may overlap.
Multi-class ClassificationEncoderIntent classification, document categorization, sentiment analysisbert-base-uncased, roberta-baseOutputs a one-hot vector with a single active class. Uses softmax activation and cross-entropy loss. Suitable when classes are mutually exclusive.

Base vs. Instruct Model Selection

  • Base Models: Trained on raw text for next-token prediction. These serve as the foundation when introducing a model to entirely new languages or highly specialized technical vocabularies.
  • Instruct Models: Pre-aligned to follow directions. These are preferable for refining specific behaviors, adjusting response tones, or enforcing strict output formats (e.g., JSON).

Parameter Size and Hardware Requirements

Model size influences the use case.

Model SizeOptimal Use Case
1B - 3BEdge devices, mobile applications, simple classification.
7B - 8BGeneral reasoning, logic tasks, and standard coding assistance.
14B - 30BComplex domain-specific logic (medical, legal, or scientific).

Model size, measured in billions of parameters (B), dictates the required VRAM to fine-tune the model.

Model ParametersQLoRA (4-bit) VRAMLoRA (16-bit) VRAM
3B~3.5 GB~8 GB
7B~5 GB~19 GB
8B~6 GB~22 GB
9B~6.5 GB~24 GB
11B~7.5 GB~29 GB
14B~8.5 GB~33 GB
27B~22 GB~64 GB
32B~26 GB~76 GB
tip

Implementation of LoRA (Low-Rank Adaptation) or QLoRA enables the fine-tuning of 7B+ models on consumer-grade hardware, such as a single 24GB GPU.

Selection Criteria

Key metrics on model repositories provide essential guidance:

  • Context Window: Determines the maximum data volume processed in a single pass. Long-form document analysis requires models with 32k to 128k context windows. Larger context windows significantly increase VRAM usage, as memory consumption scales with sequence length due to attention key/value caches, directly impacting batch size, concurrency, and hardware requirements.
tip

Protean AI provides deployment recommendation and suggests VRAM usage based on context window size. Protean AI also allows the possibility to configure the context window size during the deployment.

  • Licensing: Compliance with commercial requirements (e.g., Apache 2.0, MIT, or Llama 3 Community License) is mandatory for enterprise deployment.
  • Benchmarks: MMLU (General knowledge) and HumanEval (Coding) scores offer standardized performance comparisons.

Implementation Checklist

  1. Identify Task Category:
  2. Audit Hardware:
  3. Select Scale: Determine if a 7B/8B model provides the necessary balance of speed and intelligence.
  4. Verify Licensing: Ensure the model permits the intended commercial or research use.
  5. Baseline Testing: Evaluate the model's performance without fine-tuning to establish a performance floor.