Skip to main content

Fine-Tuning

Fine-tuning is the process of taking a pre-trained Large Language Model (LLM), which has already learned general language patterns from massive datasets—and further training it on a smaller, specific dataset. This adapts the model to excel at a particular task, domain, or speaking style.

Think of a pre-trained model as a university graduate with broad general knowledge. Fine-tuning is like giving that graduate specialized job training to become an expert in a specific field, such as law, medicine, or coding.

How it Works

  1. Start with a Base Model: You take a model that has already learned the patterns of language (like Llama or Qwen).
  2. Specific Training: You feed it a curated dataset of examples. For example, if you want a medical bot, you feed it thousands of pairs of medical questions and professional answers
  3. Weight Adjustment: During this process, the model's internal parameters (weights) are slightly adjusted. It doesn't "forget" how to speak English, but it learns that in this specific context, it should use certain terminology or follow a specific tone

Why Fine-tune?

While prompt engineering allows you to guide a model, fine-tuning permanently alters the model's weights to instill deep behavioral changes. It is particularly useful for:

  • Changing Output Style: Teaching the model to speak in a specific voice, persona, or format (e.g., JSON, SQL, or a specific poetic style).
  • Domain Specialization: Injecting specialized knowledge that the base model lacks, such as internal company terminology, medical data, or legal precedents.
  • Behavioral Alignment: Teaching the model to refuse certain types of requests or to be more helpful and harmless.
  • Cost & Efficiency: A small, fine-tuned model (e.g., Llama-3 8B) can often outperform a much larger, general-purpose model (like GPT-4) on specific tasks while being cheaper and faster to run.

Fine-tuning vs. RAG

  • RAG gives the model a "textbook" to look up information during the exam. It is best for dynamic data that changes frequently (like today's stock prices).
  • Fine-tuning is like studying for the exam so the knowledge is internalized. It is best for teaching the model how to reason, structure answers, or handle stable domain knowledge.

What is Fine-tuning Process?

Fine-tuning process involves taking a pre-trained model (which already knows "a lot") and giving it additional training on a specific, smaller dataset to make it an expert in a particular task. The dataset is typically in the form of a large number of input-output pairs. The model is then used to generate new output based on the provided input. Training dataset consists of multiple samples/examples, usually in thousands. Note that this number is far lower than the number of samples used to train the base model.

Instead of computing the gradient (the direction & magnitude of the weight update) using the entire training dataset at once, which is computationally expensive and memory-intensive (and practically impossible), the dataset is split into smaller, manageable batches. The training process is fundamentally iterative. The batch_size parameter defines how many samples, from training dataset, are processed together. For each batch (group of samples) of training data, the following three core operations occur:

Naive Implementation

  1. Forward Pass: The input samples in the batch are passed through the neural network to generate an output prediction. The loss (the error between the prediction and the true label) is then calculated.

  2. Backward Pass: The loss is used to compute the gradients (the derivative of the loss with respect to each of the model's weights). The gradients indicate the direction and magnitude needed to adjust the weights to reduce the loss.

    Perform all calculations using standard FP32 (32-bit floating point).

  3. Optimizer Step: The model's weights and biases are updated using an optimizer (e.g., Adam, SGD) based on the calculated gradients and the learning rate. This is the moment the model actually learns.

The standard process, often collectively referred to as a Mini Step, is a one-to-one relationship: 1 Batch → 1 Gradient Computation (forward & backward pass) → 1 Gradient Update (optimizer step).

Problems

While the standard process explained above is conceptually sound, it faces the following issues in practice.

1. The Memory Bottleneck (VRAM) Fine-tuning requires loading the model weights, the optimizer states, and—most importantly—the intermediate calculations (activations and gradients) for every layer onto the GPU's memory (VRAM). High-density data or large models quickly saturate the GPU's memory. While a large batch size is usually better for gradient quality, it often results in a "CUDA Out of Memory" (OOM) error.

2. The "Exploding Gradient" Problem During the backward pass, gradients are calculated by multiplying derivatives layer by layer (the Chain Rule). In deep networks, these values can grow exponentially. If a gradient becomes massive, the optimizer will make a huge update to the weights, effectively "breaking" the model. This leads to NaN (Not a Number) values or a model that suddenly stops learning.

3. Precision and Underflow (Small Gradients) Standard training traditionally uses FP32 (32-bit floating point), which is precise but "heavy" and slow. Switching to FP16 (16-bit) to save memory introduces a risk: Gradient Underflow. Because FP16 has a much smaller range, tiny gradient values can round down to zero. If the gradient becomes zero, the model stops learning because there is no "direction" to follow.

4. Updating All Parameters Full fine-tuning requires updating all parameters (usually in billions), which is expensive and often requires massive VRAM. Updating all weights of a multi‑billion parameter model requires GPU memory for parameters, gradients, and optimizer states, often 3–4× the model size, which becomes impractical.

5. Recovery from transient errors If the training process is interrupted, the model's weights and optimizer states are lost. To resume training, the entire process must be restarted from scratch. This can be time-consuming and wasteful of resources.

Protean AI Approach

There are a number of standard ways to address these issues. However, this makes training process significantly complicated. Protean AI implements all industry-standard techniques to address these issues but abstracts them away from the user. The user only needs to specify some configuration parameters, and the rest is handled automatically. The following sections explain how Protean AI addresses these issues.

Protean AI uses Gradient accumulation technique that breaks the one-to-one relationship (1 Batch → 1 Gradient Computation (forward & backward pass) → 1 Gradient Update (optimizer step), allowing you to effectively use a batch size larger than what your hardware's memory (VRAM) can hold. It introduces an intermediate step where gradients are collected before the optimizer step is executed. Instead of updating model weights after every batch, this technique allows accumulating gradient across multiple mini-steps and apply once in an optimizer step (often referred to as Step).: The process is as follows:

  1. Forward Pass: Perform forward pass on multiple batches sequentially to calculate loss. These are often referred to as mini-batches. Instead of performing calculations (within forward pass) in standard FP32, Protean AI uses mixed precision (FP16 or BF16) to reduce memory usage. The input is passed through the model using 16-bit precision. This roughly doubles the speed and significantly reduces VRAM usage because 16-bit numbers take up half the space of 32-bit numbers. The loss is calculated. Because 16-bit numbers have a limited range, very small gradients can underflow (become zero).

  2. Gradient Scaling: To prevent the "underflow" mentioned above, Protean AI uses a Gradient Scaler. Before the backward pass, it multiplies the loss by a large scale factor (e.g., 1024). This pushes the small gradient values up into a range that 16-bit floats can represent accurately.

  3. Backward Pass: Compute the gradients from the scaled loss. This method also automatically accumulates the gradients in memory by adding to the gradients computed in the previous call. These scaled gradients are added together across multiple mini-batches.

    The steps 1-3, combined are often referred to as an accumulation step, as gradients are being accumulated only and not applied.

  4. Unscaling & Gradient Clipping: Once gradients have been accumulated for a full step, they are prepared for the optimizer. The gradients are divided by the scale factor to bring them back to their original, correct values. Before the weights are updated, the norm (size) of the accumulated gradients is checked. If the gradients are too large (a problem known as Exploding Gradients), they are scaled down to a maximum threshold called Max Grad Norm using a technique called gradient clipping. Clipping prevents Exploding Gradients, ensuring the optimizer doesn't take a massive "jump" that ruins the model's weights.

  5. Update Model: The accumulated gradients are then used to update the model's weights. This effectively simulates training with a larger batch size, even if your GPU can only fit a smaller batch at once. Think of gradient accumulation like saving up several "mini" gradient updates before spending them all at once. Each mini-step contributes a small part of the total gradient, and after a few such steps, the optimizer takes a single, well-informed step, just as if you had trained with a much larger batch.

    The model's weights and biases are updated using an AdamW optimizer based on the clipped gradients (from step 4) and the learning rate. This is called step / optimizer step.

    Instead of updating/retraining all parameters, Protean AI uses QLoRA, which freezes the original weights. Then, on top of this original model, it adds a lightweight addition called a low-rank matrix, and only these additional parameters are trained. This approach dramatically reduces memory usage and makes fine-tuning significantly faster and more efficient.

  6. Snapshot Training Process: Protean AI automatically saves the model weights and optimizer states at the end of each step. This allows you to resume training from the last saved snapshot. A snapshot is a comprehensive record of the state of the training process at a specific moment in time (usually at the end of an epoch or a set number of steps). Unlike a final model export, which might only include the weights, a snapshot captures the entire "living" environment of the training run. This allows you to resume training from a specific point in the process.

The 'Bucket' Analogy The "Bucket" AnalogyTo visualize why we use the Effective Batch Size, imagine filling a bucket:

  • Physical Batch Size: A small cup you use to scoop water.
  • Mixed Precision: Using a lighter, plastic cup (16-bit) instead of a heavy glass one (32-bit) so you can move faster.
  • Gradient Scaling: Adding "dye" to the water so you can see even the tiniest drops clearly and don't miss any.
  • Accumulation Steps: The number of cups it takes to fill the bucket.
  • Gradient Clipping: A lid on the bucket that prevents water from splashing out if you move too violently.
  • One Step: Dumping the bucket and updating the "Master Record" of how much water you've moved.

In a Nutshell

Fine-tuning is complex to setup. Protean AI uses the following techniques to reduce memory usage and improve training speed:

Gradient Accumulation

  • The Problem: Your GPU doesn't have enough VRAM to hold a large, stable batch of data at once.
  • The Solution: It splits a large Effective Batch into smaller, manageable Physical Batches.
  • How it works: The model performs several forward and backward passes, "saving up" (accumulating) the gradients in memory without updating the weights immediately. Once the desired number of steps is reached, the optimizer uses the total accumulated gradient to make one single, high-quality weight update.
  • Key Benefit: Simulates a large batch size on limited hardware. For example, if the maximum batch size that fits on your GPU is 4, keeping the batch size at 4 ensures better GPU utilization and more efficient training.
Batch SizeGradient Accumulation StepsEffective Batch SizeRecommended
16464No
41664Yes

Mixed Precision Training

  • The Problem: Using standard 32-bit (FP32) floats for every calculation is slow and consumes massive amounts of memory.
  • The Solution: Uses 16-bit (FP16 or BF16) for the "heavy lifting" math.
  • How it works: It performs the forward and backward passes in 16-bit precision (which is faster and uses half the memory) while maintaining a "Master Copy" of the weights in 32-bit to ensure the model doesn't lose precision over time.
  • Key Benefit: Dramatically increases training speed and reduces VRAM usage.

Gradient Scaling

  • The Problem: In 16-bit Mixed Precision, very small gradient values can "underflow" (turn into zeros), causing the model to stop learning.
  • The Solution: Artificially inflates the loss values during the backward pass to preserve small signals.
  • How it works: It multiplies the loss by a large scale factor before backpropagation. This pushes tiny gradients into a range that 16-bit floats can represent. The gradients are then "unscaled" back to their original size before the weights are updated.
  • Key Benefit: Prevents "Vanishing Gradients" specifically caused by lower-precision math.

Gradient Clipping

  • The Problem: Exploding Gradients—when gradients become so large that they cause the optimizer to make a massive "jump," leading to NaN errors or ruining the model weights.
  • The Solution: A safety threshold that caps the maximum magnitude of a gradient.
  • How it works: It calculates the "norm" (total size) of the gradients. If that value exceeds a pre-defined limit (e.g., 1.0), it scales all gradients down proportionally to stay within a safe boundary.
  • Key Benefit: Ensures training stability and prevents the model from "breaking" during volatile updates.

QLoRA (Quantized LoRA)

  • The Problem: Full fine-tuning requires updating billions of parameters and storing massive Optimizer States, which often exceeds the VRAM of even high-end GPUs.
  • The Solution: Combines LoRA (training tiny adapter matrices) with 4-bit Quantization.
  • How it works: It compresses the base model weights down to 4-bit (Quantization) to save massive amounts of VRAM. It then adds a small set of trainable 16-bit "Adapter" layers. During training, the 4-bit weights are frozen, and the GPU only needs to track gradients and optimizer states for the tiny adapters.
  • Key Benefit: Allows you to fine-tune massive models (like Llama 3 70B) on consumer-grade hardware.

Snapshot

  • The Problem: Training can be interrupted at any time, resulting in lost time and wasted resources.
  • The Solution: Automatically saves the model weights and optimizer states at the end of each step.
  • How it works: It allows you to resume training from the last saved snapshot.
  • Key Benefit: Reduces the risk of losing training progress if the training process is interrupted.

Access Control

Access Control in Protean AI governs who can view, create, modify, and operate resources across the platform. It is designed for enterprise environments where security, isolation, and governance are mandatory.

Protean AI follows a principle of least privilege, ensuring users and systems are granted only the permissions required to perform their tasks.

Role→
Action↓
AdminModel AdminUserOwnerViewerDescription
CreateYesYesYesNANACreate a fine-tune configuration
ReadYesYesNoYesYesRead a fine-tune configuration
UpdateYesYesNoYesNoModify a fine-tune configuration, manage training runtime.
DeleteYesYesNoYesNoDelete a fine-tune configuration
Manage AccessYesYesNoYesNoGrant or revoke permissions for users and groups.
note

At the moment, managing fine-tune configuration and runtime does not gaurantee adapter publishing and deployment permissions. We are working on broader (less granular) permissions to fine-tune configuration, that will be able to control adapter publishing and deployment.

Workflow

  1. Select a Base Model: Choose a high-quality open-source model (e.g., Llama-3, Qwen-3).
  2. Prepare Data: Create a dataset of input-output pairs (e.g., conversational data, prompt-completion data).
  3. Set Hyperparameters: Create a set of hyperparameters for training and evaluation.
  4. Train: Use ProteanAI to update the model's weights based on your data.
  5. Evaluate: Use ProteanAI to evaluate and compare various training runs.
  6. Publish Adapter: Publish an adapter from the step that has the best evaluation results.
  7. Deploy Adapter: Deploy the adapter in a model runtime and start to inference.