What is trial time training?

Hyperspecialize any general purpose model

.toc-list { position: relative; } .toc-list { overflow: hidden; liststyle: none; } .gh-toc .is-active-link::before { background-color: var(–ghost-accent-color); /* Sets the TOC accent color based on the accent color set in Ghost Admin */ } .gl-toc__header { align-items: center; color: var(–foreground); cursor: pointer; screen: flexible; gap: 2rem; justify content: space between; filling: 1rem; width: 100%; } .gh-toc-title { font-size: 15px !important; font weight: 600! Important; letter spacing: .0075rem; line-height: 1.2; margin: 0; texttransformation: uppercase; } .gl-toc__icon { transition: transform .2s ease of input and output; } .gh-toc li { color: #404040; font-size: 14px; line-height: 1.3; bottom margin: .75rem; } .gh-toc { display: none; } .gh-toc.active { show: block; } .gl-toc__icon svg{ transition: transform 0.2s easy in and out; } .gh-toc.active + .gl-toc__header .gl-toc__icon .rotated{ transform: rotate(180 degrees); } .gl-toc__icon .rotated{ transform: rotate(180 degrees); } .gh-toc-container-sidebar{ display: none; } .gh-toc-container-content{ display: block; width: 100%; } a.toc-link{ background-image: none! Important; } .gh-toc-container-content .toc-list-item{ margin-left: 0 !important; } .gh-toc-container-content .toc-list-item::marker{ content: none; } .gh-toc-container-content .toc-list{ padding: 0 !important; margin: 0 !important; } Display only @media and (min-width: 1200px) { .gh-sidebar-wrapper{ margin: 0; position: sticky; up: 6rem; left: calc((( 100vw – 928px)/ 2 ) – 16.25rem – 60px); z-index: 3; } .gh-sidebar { align-self: flex-start; background color: transparent; flexible steering: column; gridarea: knock; maximum height: calc(100vh – 6rem); width: 16.25 rem; z-index: 3; position: sticky; top: 80px; } .gh-sidebar:before { -webkit-backdrop-filter: blur(30px); backgroundfilter:blur(30px); background-color:hsla(0, 0%, 100%, .5); edge radius: .5rem; content: “”; show: block; height: 100%; left: 0; position: absolute; top: 0; width: 100%; z-index: -1; } .gl-toc__header { cursor: default; flexible shrinkage: 0; pointerevents: none; } .gl-toc__icon { display: none; } .gh-toc { show: block; flex: 1; overflow-y: auto; } .gh-toc-container-sidebar{ display: block; } .gh-toc-container-content{ display: none; } } ))>

Introduction

Backpropagation has been the driving force behind the deep learning revolution. We have come a long way with advances such as:

New layers such as Convolutional Neural Networks, Recurrent Neural Networks, Transformers.
New training paradigms such as tuning, transfer learning, self-supervised learning, contrastive learning, and reinforcement learning.
New optimizers, regularizers, augmentations, loss functions, frameworks and many more…

However, the Abstraction and Reasoning Corpus (ARC) dataset, created more than five years ago, has stood the test of numerous architectures but has never changed. It remains one of the most difficult data sets, where even the best models cannot surpass human-level accuracy. This was an indication that true AGI is still far out of reach.

Last week, a new paper “The Surprising Effectiveness of Test-Time Training for Abstract Reasoning” pushed a relatively novel technique, reaching a new level of accuracy on the ARC data set that has the deep learning community excited. similar to how AlexNet did it 12 years ago.

TTT was invented five years ago, where training occurs on very few samples (usually one or two) similar to the test data point. The model can update its parameters based on these examples, hyper-fitting to just those data points.

TTT is analogous to transforming a general practitioner into a surgeon who is now super specialized in only Heart valve replacements.

In this post, we will learn what TTT is, how we can apply it in various tasks and discuss the advantages, disadvantages and implications of using TTT in real-world scenarios.

What is trial time training?

Humans are very adaptable. Two learning phases follow for any task: a general learning phase that begins from birth and a task-specific learning phase, often known as task orientation. Similarly, TTT complements pretraining and tuning as a second phase of learning that occurs during inference.

Simply put, Test Time Training involves cloning a model trained during the testing phase and fitting it on data points similar to the data you want to make an inference about. To divide the process into steps, during inference, given a new test data point to infer, we perform the following actions:

clone the model (general purpose),
collect data points from the training set that are closest to the test point, either through some prior knowledge or by incorporating similarities,
create a smaller training data set with inputs and targets using the data from the previous step,
decide on a loss function and train the cloned model on this small data set,
use the updated clone model to predict that test data point.

As a simple example, you can take a trained linear regression model, update the slope of a set of points in the vicinity of the test point, and use it to make more accurate predictions.

K-Nearest Neighbors is an extreme example of a TTT process where the only training that is done is during testing time.

In the LLM arena, TTT is especially useful when the tasks are complex and outside of what an LLM has seen before.

Learning in context, brief prompts, chain-of-thought reasoning, and augmented retrieval generation have been standards for improving LLMs during inference. These techniques enrich the context before arriving at a final answer, but they fail in one aspect: the model does not adapt to the new environment at the time of testing. With TTT, we can make the model learn new concepts that would otherwise unnecessarily capture a large amount of data.

Neural Network/LLM hyperspecializes during TTT

The ARC data set is ideal for this paradigm, as each data sample is a collection of short examples followed by a question that can only be solved using the given examples, similar to how the SAT exams require you to find the following diagram in a sequence.

As shown in the image above, the first three examples can be used to train during the test time and predict in the fourth image.

How to perform TTT

The brilliance of the TTT lies in its simplicity; extends learning to the testing phase. Therefore, all standard training techniques can be applied here, but there are practical aspects to consider.

Since training is computationally expensive, TTT adds more overhead since, in theory, training is required for each inference. To mitigate this cost, consider:

Parameter Efficient Fine Tuning (PEFT): During LLM training, training with LoRA is considerably cheaper and faster. It is always advisable to train only on a small subset of layers, as in PEFT, rather than performing a full model fitting.

def test_time_train(llm, test_input, nearest_examples, loss_fn, OptimizerClass):
    lora_adapters = initialize_lora(llm)
    optimizer = OptimizerClass(lora_adapters, learning_rate)
    new_model = merge(llm, lora_adapters)

    for nearest_example_input, nearest_example_target in nearest_examples:
        nearest_example_prediction = new_model(nearest_example_input)
        loss = loss_fn(nearest_example_prediction, nearest_example_target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    predictions = new_model(test_input)
    return predictions

Psuedocode for training in testing times with LLM

Transfer learning: During conventional transfer learning, a new task header can be replaced/added and the model trained.

def test_time_train(base_model, test_input, nearest_examples, loss_fn, OptimizerClass):
    new_head = clone(base_model.head)
    optimizer = OptimizerClass(new_head, learning_rate)

    for nearest_example_input, nearest_example_target in nearest_examples:
        nearest_example_feature = base_model.backbone(nearest_example_input)
        nearest_example_prediction = new_head(nearest_example_feature)
        loss = loss_fn(nearest_example_prediction, nearest_example_target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    test_features = base_model.backbone(test_input)
    predictions = new_head(test_features)
    return predictions

Psuedocode for test-time training with conventional transfer learning

Embed Reuse: Keep track of what inferences were made, that is, what LoRAs were used. During inference, if the embedding of a new data point is close enough to the existing ones, an existing LoRA/Task-Head could be reused.
Test Time Increases (TTA): TTA clones the inference image and applies augmentations. The average of all predictions provides a more robust result. In TTT, this can improve performance by enriching the training data.

Real world uses

medical diagnosis: Adjusting general diagnostic models for specific patient conditions or rare diseases with limited data.
Personalized education: Adapt an educational ai to a student's learning style using specific examples.
Customer service chatbots: Improve chatbots for specific queries by training on specific topics during a session.
Autonomous Vehicles: Adapt vehicle control models to local traffic patterns.
Fraud detection: Specialized models for a specific business or unusual transaction patterns.
Analysis of legal documents: Adaptation of models to interpret legal precedents of specific cases.
Creative content generation: Customize LLMs to generate contextually relevant content, such as ads or stories.
Document data extraction: Adjusting specific templates to extract data more accurately.

Advantages

Hyperspecialization: Useful for rare data points or one-off tasks.
Data efficiency: Fine tuning with minimal data for specific scenarios.
Flexibility: Improves generalization across multiple specializations.
Domain adaptation: Addresses distribution drift during long deployments.

Disadvantages

Computational cost: Additional training in inference can be costly.
Latency: Not suitable for real-time LLM applications with current technology.
Risk of maladaptation: Making adjustments to irrelevant examples can degrade performance.
Risk of poor performance in simple models: TTT shines when the model has a large number of parameters to learn and the data during testing time has a high degree of variation. When you try to apply TTT with simple models like linear regression, it will only overfit to local data and this is nothing more than overfitting multiple models using KNN sample data.
Complex integration: Requires careful design to integrate training into inference and monitor multiple models.

TTT is a promising tool, but with significant overhead and risks. When used wisely, it can boost model performance in challenging scenarios beyond what conventional methods can achieve.