Skip to main content
This advanced tutorial demonstrates building unified foundation models that can understand both natural language and temporal patterns simultaneously. We’ll train a single model from scratch using the Mistral-7B-Instruct architecture, jointly optimizing on text generation and time series forecasting tasks.

What You’ll Learn

In this tutorial, we’ll cover:
  • Multi-modal architecture design using Mistral-7B for unified text and time series processing
  • Custom time series tokenization with log normalization and spike handling
  • Joint training strategies with equal task weighting and shared representations
  • Advanced lag feature engineering for improved temporal modeling
  • Cross-domain knowledge transfer between language and time series domains
You’ll need one text data and one time series data ready in the required format.

Step 1: Data Preparation Configurations

Understanding Time Series Tokenization Challenges

Time series data presents unique challenges that require specialized handling:
  • Extreme Values: Time series often contain outliers or spikes that can destabilize model training
  • Multi-Scale Patterns: Underlying patterns exist across different orders of magnitude
  • Gradient Stability: Raw values can cause gradient explosion during training

Custom Tokenizer for Spike Handling

We’ll create a customized tokenizer that addresses these challenges through log normalization and spike clipping:
customize_tokenizer.py
import numpy as np

def log_spike_normalization(time_series):
    """Normalize time series with spike handling via log transformation."""
    ts_array = np.array(time_series, dtype=np.float32)
    
    # Ensure all values are positive for log transformation
    min_val = np.min(ts_array)
    if min_val < 1:
        ts_array = ts_array + (1 - min_val)
    
    # Apply log transformation and clip extreme values
    ts_log = np.log(ts_array)
    return np.clip(ts_log, a_min=None, a_max=20.0).tolist()
Now we’ll create separate data preparation configurations for text and time series data:

Text Data Preparation

text_data_config.py
from pynolano import DataPreparationConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./text_training_data.jsonl",
        output_path="./prepared_text_data",
        tokenization="mistralai/Mistral-7B-Instruct-v0.1",  # Use same tokenizer as model
        max_sequence_length=4096
    )

Time Series Data Preparation

ts_data_config.py
from pynolano import DataPreparationConfig, TimeSeriesTokenizerConfig
from customize_tokenizer import log_spike_normalization

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./time_series_data.jsonl",
        output_path="./prepared_ts_data",
        tokenization=TimeSeriesTokenizerConfig(
            type="chronos",  # Use Chronos tokenizer as specified
            normalization_method=log_spike_normalization,  # Custom normalization function
        ),
        max_sequence_length=2048
    )

Run Data Preparation

Prepare both datasets:
# Prepare text data
nolano prepare_data text_data_config.py

# Prepare time series data  
nolano prepare_data ts_data_config.py

Step 2: Joint Training Configuration

Configure the model for joint training on both text and time series with equal weighting:
joint_train_config.py
from pynolano import (
    ExperimentConfig, 
    DataConfig, 
    ModelConfig, 
    OptimizationConfig,
    MetaConfig
)

def build() -> ExperimentConfig:
    return ExperimentConfig(
        # Dual-modal data configuration with equal sampling weights
        data_configs=[
            DataConfig(
                data_paths="./prepared_text_data",
                training_objective="cross_entropy",  # Standard language modeling objective
                sampling_weight=0.5,  # Equal weight - 50% of training data
            ),
            DataConfig(
                data_paths="./prepared_ts_data", 
                training_objective="cross_entropy",  # Cross entropy for time series tokens
                sampling_weight=0.5,  # Equal weight - 50% of training data
                features=["lag_features"]
            )
        ],
        model_config=ModelConfig(
            architecture="mistralai/Mistral-7B-Instruct-v0.1",
            init_method="xavier_uniform",  # Xavier random initialization from scratch
            # The platform will automatically adapt the architecture for multi-modal inputs
        ),
        optimization_config=OptimizationConfig(
            total_training_steps=30000,
            max_learning_rate=2e-4,  # Balanced learning rate for joint training
            global_batch_size=64,
            learning_rate_schedule="cosine",
            warmup_steps=3000,
            weight_decay=0.01,
            gradient_clipping=1.0  # Important for stability in multi-modal training
        ),
        meta_config=MetaConfig(
            name="joint-text-timeseries-mistral-7b",
            model_save_frequency=3000,
            max_checkpoints=5,
            seed=42
        )
    )
Start the multi-modal training process:
nolano train joint_train_config.py
The platform will automatically:
  • Initialize a Mistral-7B model from scratch with Xavier initialization
  • Adapt the architecture to handle both text tokens and time series tokens
  • Apply the custom time series tokenizer with log normalization during training
  • Generate and utilize lag features for improved temporal modeling
  • Balance training between text and time series tasks with equal weighting

Advanced Multi-Modal Features

This tutorial demonstrates several cutting-edge capabilities:
Unified Model for Dual ModalitiesThe Mistral-7B architecture is automatically adapted for joint training:
  • Shared transformer layers process both text and time series tokens
  • Modality-specific input/output heads handle domain differences
  • Cross-attention mechanisms enable knowledge transfer between modalities
  • Joint embedding space captures relationships across domains
Advanced Spike-Aware NormalizationOur log-based normalization strategy provides:
  • Robust handling of extreme values and outliers (spikes)
  • Logarithmic scaling preserves relative relationships
  • Clipping at 20 prevents gradient instability from extreme values
  • Offset ensures all values are positive before log transformation
Intelligent Temporal Feature GenerationThe platform automatically creates relevant lag features:
  • Adaptive lag selection based on detected patterns
  • Seasonal lag features for periodic data
  • Rolling statistics for trend analysis
  • Cross-correlation features between different lag periods
Balanced Multi-Modal LearningThe 50/50 weighting approach ensures:
  • Neither modality dominates the learning process
  • Shared representations benefit both tasks equally
  • Gradient balance prevents mode collapse
  • Consistent performance improvements across both domains
Synergistic Learning BenefitsJoint training provides unique advantages:
  • Language understanding improves pattern recognition in time series
  • Temporal reasoning enhances sequential text processing
  • Shared attention mechanisms capture long-range dependencies
  • Common representation space enables zero-shot cross-domain transfer