Skip to main content

Building Time Series Forecasting Foundation Models

This comprehensive tutorial demonstrates building state-of-the-art time series forecasting foundation models through cross-modal continual pretraining. We’ll start with a pretrained language model (Gemma-3-4B) and adapt it for time series forecasting using advanced patch-based tokenization and a custom research multi-objective optimization with MAE and Quantile Loss with 20 quantiles.

What You’ll Learn

In this tutorial, we’ll cover:
  • Cross-modal continual pretraining from language models to time series
  • Patch-based time series tokenization with masking for robust learning
  • Custom multi-objective loss functions combining MAE and pinball loss
  • Advanced forecasting techniques using foundation model approaches
  • Production-ready model deployment for time series prediction

Step 1: Data Preparation

Your time series data should be in JSONL format following the AutoGluonTS format. Each line should contain a dictionary with the following structure: We’ll use patch-based tokenization to convert time series into sequences that can be processed by transformer architectures, enabling transfer from language models.
data_config.py
from pynolano import DataPreparationConfig, TimeSeriesTokenizerConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./time_series_data.jsonl",
        output_path="./prepared_ts_foundation",
        tokenization=TimeSeriesTokenizerConfig(
            type="patch_based",
            input_patch_size=32,     # Input patches of 32 time steps
            output_patch_size=128,   # Predict patches of 128 time steps 
            patch_masking=True,      # Enable patch masking for robust learning
            normalization_method="z-norm"  # Standardize values
        ),
        max_sequence_length=2048  # Maximum sequence length for efficient training
    )
Run Data Preparation
nolano prepare_data data_config.py

Step 2: Custom Loss Function Implementation

Our custom loss function combines Mean Absolute Error (MAE) with Pinball Loss for quantile forecasting. While MAE is well-known, let’s focus on the key component:

Pinball Loss (Quantile Loss)

Pinball loss enables probabilistic forecasting by predicting multiple quantiles with asymmetric penalties: Pinball(τ)=1ni=1nmax(τ(yiy^iτ),(τ1)(yiy^iτ))\text{Pinball}(\tau) = \frac{1}{n} \sum_{i=1}^{n} \max\left(\tau(y_i - \hat{y}_i^\tau), (\tau-1)(y_i - \hat{y}_i^\tau)\right) Where:
  • τ\tau is the quantile level (e.g., 0.1 for 10th percentile)
  • y^iτ\hat{y}_i^\tau is the predicted value for quantile τ\tau
  • The loss penalizes under-prediction more for high quantiles, over-prediction more for low quantiles

Custom Multi-Objective Loss Function

Create a custom multi-objective loss function that combines MAE and pinball loss for comprehensive forecasting:
custom_loss.py
import torch
import torch.nn as nn

def multi_objective_forecasting_loss(logits, targets):
    """
    Custom multi-objective loss combining MAE (70%) and Pinball loss (30%)
    
    Args:
        logits: Model predictions of shape (..., sequence_length, output_patch_size)
        targets: Ground truth values of shape (..., sequence_length, output_patch_size)
    
    Returns:
        loss: Single scalar loss value
    """
    # Flatten the last two dimensions for easier computation
    # Shape: (..., sequence_length * output_patch_size)
    logits_flat = logits.view(*logits.shape[:-2], -1)
    targets_flat = targets.view(*targets.shape[:-2], -1)
    
    # 1. Mean Absolute Error (MAE) - 70% weight
    mae_loss = torch.mean(torch.abs(logits_flat - targets_flat))
    
    # 2. Pinball Loss for 20 quantiles - 30% weight
    quantiles = torch.linspace(0.05, 0.95, 20, device=logits.device)  # 20 quantiles from 5% to 95%
    pinball_losses = []
    
    for tau in quantiles:
        # For pinball loss, we interpret predictions as quantile estimates
        # This is a simplified approach - in practice, you might have separate heads for each quantile
        errors = targets_flat - logits_flat
        pinball_loss = torch.where(
            errors >= 0,
            tau * errors,
            (tau - 1) * errors
        )
        pinball_losses.append(torch.mean(pinball_loss))
    
    # Average pinball loss across all quantiles
    avg_pinball_loss = torch.stack(pinball_losses).mean()
    
    # Combine losses with specified weights
    total_loss = 0.7 * mae_loss + 0.3 * avg_pinball_loss
    
    return total_loss

Step 3: Training Configuration

Configure the model for cross-modal continual pretraining from Gemma-3-4B to time series forecasting:
train_config.py
from pynolano import (
    ExperimentConfig, 
    DataConfig, 
    ModelConfig, 
    OptimizationConfig,
    MetaConfig
)
from custom_loss import multi_objective_forecasting_loss

def build() -> ExperimentConfig:
    return ExperimentConfig(
        data_configs=[
            DataConfig(
                data_paths="./prepared_ts_foundation",
                training_objective=multi_objective_forecasting_loss,  # Custom loss function
                validation_split=0.15
            )
        ],
        model_config=ModelConfig(
            architecture="google/gemma-3-4b-pt",  # Pretrained Gemma model
            init_method="none",  # Don't reinitialize weights - use pretrained
            # Cross-modal adaptation will be handled automatically
        ),
        optimization_config=OptimizationConfig(
            total_training_steps=25000,
            max_learning_rate=5e-5,  # Lower learning rate for continual pretraining
            global_batch_size=64,
            learning_rate_schedule="cosine",
            warmup_steps=2500,
            weight_decay=0.01,
            gradient_clipping=1.0  # Important for stability in cross-modal training
        ),
        meta_config=MetaConfig(
            name="ts-foundation-gemma-4b",
            model_save_frequency=2500,
            max_checkpoints=5,
            seed=42
        )
    )
Launch the training process:
nolano train train_config.py
The platform will automatically:
  • Adapt the Gemma-3-4B architecture for time series processing
  • Apply patch-based tokenization during training
  • Optimize using the custom multi-objective loss function
  • Scale across multiple GPUs for efficient training

Advanced Foundation Model Features

The tutorial showcases several cutting-edge capabilities:
Language-to-Time Series Transfer LearningOur approach leverages pretrained language models for time series:
  • Preserves rich representational knowledge from language pretraining
  • Adapts transformer architectures to temporal patterns
  • Enables few-shot learning on new time series domains
  • Significantly reduces training time compared to training from scratch
Advanced Time Series RepresentationThe patch-based approach treats time series like sequences:
  • Input patches of 32 time steps for context understanding
  • Output patches of 128 time steps for multi-step forecasting
  • Patch masking during training improves robustness
  • Enables efficient processing of long time series
Sophisticated Loss Function DesignOur custom loss combines multiple objectives:
  • 70% MAE weight for robust point forecasting
  • 30% pinball loss weight for uncertainty quantification
  • 20 quantiles (vs. standard 10) for detailed probabilistic forecasting
  • Asymmetric penalty structure for realistic cost modeling
Scalable and Transferable ArchitectureThe foundation model approach provides:
  • Zero-shot forecasting on new time series
  • Few-shot adaptation to domain-specific patterns
  • Robust performance across diverse time series types
  • Efficient fine-tuning for specialized applications