> ## Documentation Index
> Fetch the complete documentation index at: https://internal.nolano.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Time Series Forecasting Foundation Models

> Build powerful time series forecasting foundation models using cross-modal continual pretraining from language models

# Building Time Series Forecasting Foundation Models

This comprehensive tutorial demonstrates building state-of-the-art time series forecasting foundation models through cross-modal continual pretraining. We'll start with a pretrained language model (Gemma-3-4B) and adapt it for time series forecasting using advanced patch-based tokenization and a custom research multi-objective optimization with MAE and Quantile Loss with 20 quantiles.

## What You'll Learn

In this tutorial, we'll cover:

* **Cross-modal continual pretraining** from language models to time series
* **Patch-based time series tokenization** with masking for robust learning
* **Custom multi-objective loss functions** combining MAE and pinball loss
* **Advanced forecasting techniques** using foundation model approaches
* **Production-ready model deployment** for time series prediction

## Step 1: Data Preparation

Your time series data should be in JSONL format following the AutoGluonTS format. Each line should contain a dictionary with the following structure:

We'll use patch-based tokenization to convert time series into sequences that can be processed by transformer architectures, enabling transfer from language models.

```python data_config.py theme={null}
from pynolano import DataPreparationConfig, TimeSeriesTokenizerConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./time_series_data.jsonl",
        output_path="./prepared_ts_foundation",
        tokenization=TimeSeriesTokenizerConfig(
            type="patch_based",
            input_patch_size=32,     # Input patches of 32 time steps
            output_patch_size=128,   # Predict patches of 128 time steps 
            patch_masking=True,      # Enable patch masking for robust learning
            normalization_method="z-norm"  # Standardize values
        ),
        max_sequence_length=2048  # Maximum sequence length for efficient training
    )
```

**Run Data Preparation**

```bash theme={null}
nolano prepare_data data_config.py
```

## Step 2: Custom Loss Function Implementation

Our custom loss function combines Mean Absolute Error (MAE) with Pinball Loss for quantile forecasting. While MAE is well-known, let's focus on the key component:

### Pinball Loss (Quantile Loss)

Pinball loss enables probabilistic forecasting by predicting multiple quantiles with asymmetric penalties:

$$
\text{Pinball}(\tau) = \frac{1}{n} \sum_{i=1}^{n} \max\left(\tau(y_i - \hat{y}_i^\tau), (\tau-1)(y_i - \hat{y}_i^\tau)\right)
$$

Where:

* $\tau$ is the quantile level (e.g., 0.1 for 10th percentile)
* $\hat{y}_i^\tau$ is the predicted value for quantile $\tau$
* The loss penalizes under-prediction more for high quantiles, over-prediction more for low quantiles

### Custom Multi-Objective Loss Function

Create a custom multi-objective loss function that combines MAE and pinball loss for comprehensive forecasting:

```python custom_loss.py theme={null}
import torch
import torch.nn as nn

def multi_objective_forecasting_loss(logits, targets):
    """
    Custom multi-objective loss combining MAE (70%) and Pinball loss (30%)
    
    Args:
        logits: Model predictions of shape (..., sequence_length, output_patch_size)
        targets: Ground truth values of shape (..., sequence_length, output_patch_size)
    
    Returns:
        loss: Single scalar loss value
    """
    # Flatten the last two dimensions for easier computation
    # Shape: (..., sequence_length * output_patch_size)
    logits_flat = logits.view(*logits.shape[:-2], -1)
    targets_flat = targets.view(*targets.shape[:-2], -1)
    
    # 1. Mean Absolute Error (MAE) - 70% weight
    mae_loss = torch.mean(torch.abs(logits_flat - targets_flat))
    
    # 2. Pinball Loss for 20 quantiles - 30% weight
    quantiles = torch.linspace(0.05, 0.95, 20, device=logits.device)  # 20 quantiles from 5% to 95%
    pinball_losses = []
    
    for tau in quantiles:
        # For pinball loss, we interpret predictions as quantile estimates
        # This is a simplified approach - in practice, you might have separate heads for each quantile
        errors = targets_flat - logits_flat
        pinball_loss = torch.where(
            errors >= 0,
            tau * errors,
            (tau - 1) * errors
        )
        pinball_losses.append(torch.mean(pinball_loss))
    
    # Average pinball loss across all quantiles
    avg_pinball_loss = torch.stack(pinball_losses).mean()
    
    # Combine losses with specified weights
    total_loss = 0.7 * mae_loss + 0.3 * avg_pinball_loss
    
    return total_loss
```

## Step 3: Training Configuration

Configure the model for cross-modal continual pretraining from Gemma-3-4B to time series forecasting:

```python train_config.py theme={null}
from pynolano import (
    ExperimentConfig, 
    DataConfig, 
    ModelConfig, 
    OptimizationConfig,
    MetaConfig
)
from custom_loss import multi_objective_forecasting_loss

def build() -> ExperimentConfig:
    return ExperimentConfig(
        data_configs=[
            DataConfig(
                data_paths="./prepared_ts_foundation",
                training_objective=multi_objective_forecasting_loss,  # Custom loss function
                validation_split=0.15
            )
        ],
        model_config=ModelConfig(
            architecture="google/gemma-3-4b-pt",  # Pretrained Gemma model
            init_method="none",  # Don't reinitialize weights - use pretrained
            # Cross-modal adaptation will be handled automatically
        ),
        optimization_config=OptimizationConfig(
            total_training_steps=25000,
            max_learning_rate=5e-5,  # Lower learning rate for continual pretraining
            global_batch_size=64,
            learning_rate_schedule="cosine",
            warmup_steps=2500,
            weight_decay=0.01,
            gradient_clipping=1.0  # Important for stability in cross-modal training
        ),
        meta_config=MetaConfig(
            name="ts-foundation-gemma-4b",
            model_save_frequency=2500,
            max_checkpoints=5,
            seed=42
        )
    )
```

Launch the training process:

```bash theme={null}
nolano train train_config.py
```

The platform will automatically:

* Adapt the Gemma-3-4B architecture for time series processing
* Apply patch-based tokenization during training
* Optimize using the custom multi-objective loss function
* Scale across multiple GPUs for efficient training

## Advanced Foundation Model Features

The tutorial showcases several cutting-edge capabilities:

<AccordionGroup>
  <Accordion title="Cross-Modal Continual Pretraining">
    **Language-to-Time Series Transfer Learning**

    Our approach leverages pretrained language models for time series:

    * Preserves rich representational knowledge from language pretraining
    * Adapts transformer architectures to temporal patterns
    * Enables few-shot learning on new time series domains
    * Significantly reduces training time compared to training from scratch
  </Accordion>

  <Accordion title="Patch-Based Tokenization">
    **Advanced Time Series Representation**

    The patch-based approach treats time series like sequences:

    * Input patches of 32 time steps for context understanding
    * Output patches of 128 time steps for multi-step forecasting
    * Patch masking during training improves robustness
    * Enables efficient processing of long time series
  </Accordion>

  <Accordion title="Multi-Objective Loss Optimization">
    **Sophisticated Loss Function Design**

    Our custom loss combines multiple objectives:

    * 70% MAE weight for robust point forecasting
    * 30% pinball loss weight for uncertainty quantification
    * 20 quantiles (vs. standard 10) for detailed probabilistic forecasting
    * Asymmetric penalty structure for realistic cost modeling
  </Accordion>

  <Accordion title="Foundation Model Benefits">
    **Scalable and Transferable Architecture**

    The foundation model approach provides:

    * Zero-shot forecasting on new time series
    * Few-shot adaptation to domain-specific patterns
    * Robust performance across diverse time series types
    * Efficient fine-tuning for specialized applications
  </Accordion>
</AccordionGroup>