What You’ll Learn
In this tutorial, we’ll cover:- Multi-modal architecture design using Mistral-7B for unified text and time series processing
- Custom time series tokenization with log normalization and spike handling
- Joint training strategies with equal task weighting and shared representations
- Advanced lag feature engineering for improved temporal modeling
- Cross-domain knowledge transfer between language and time series domains
Step 1: Data Preparation Configurations
Understanding Time Series Tokenization Challenges
Time series data presents unique challenges that require specialized handling:- Extreme Values: Time series often contain outliers or spikes that can destabilize model training
- Multi-Scale Patterns: Underlying patterns exist across different orders of magnitude
- Gradient Stability: Raw values can cause gradient explosion during training
Custom Tokenizer for Spike Handling
We’ll create a customized tokenizer that addresses these challenges through log normalization and spike clipping:customize_tokenizer.py
Text Data Preparation
text_data_config.py
Time Series Data Preparation
ts_data_config.py
Run Data Preparation
Prepare both datasets:Step 2: Joint Training Configuration
Configure the model for joint training on both text and time series with equal weighting:joint_train_config.py
- Initialize a Mistral-7B model from scratch with Xavier initialization
- Adapt the architecture to handle both text tokens and time series tokens
- Apply the custom time series tokenizer with log normalization during training
- Generate and utilize lag features for improved temporal modeling
- Balance training between text and time series tasks with equal weighting
Advanced Multi-Modal Features
This tutorial demonstrates several cutting-edge capabilities:Joint Architecture Adaptation
Joint Architecture Adaptation
Unified Model for Dual ModalitiesThe Mistral-7B architecture is automatically adapted for joint training:
- Shared transformer layers process both text and time series tokens
- Modality-specific input/output heads handle domain differences
- Cross-attention mechanisms enable knowledge transfer between modalities
- Joint embedding space captures relationships across domains
Custom Time Series Tokenization
Custom Time Series Tokenization
Advanced Spike-Aware NormalizationOur log-based normalization strategy provides:
- Robust handling of extreme values and outliers (spikes)
- Logarithmic scaling preserves relative relationships
- Clipping at 20 prevents gradient instability from extreme values
- Offset ensures all values are positive before log transformation
Automatic Lag Feature Engineering
Automatic Lag Feature Engineering
Intelligent Temporal Feature GenerationThe platform automatically creates relevant lag features:
- Adaptive lag selection based on detected patterns
- Seasonal lag features for periodic data
- Rolling statistics for trend analysis
- Cross-correlation features between different lag periods
Equal Task Weighting Strategy
Equal Task Weighting Strategy
Balanced Multi-Modal LearningThe 50/50 weighting approach ensures:
- Neither modality dominates the learning process
- Shared representations benefit both tasks equally
- Gradient balance prevents mode collapse
- Consistent performance improvements across both domains
Cross-Domain Knowledge Transfer
Cross-Domain Knowledge Transfer
Synergistic Learning BenefitsJoint training provides unique advantages:
- Language understanding improves pattern recognition in time series
- Temporal reasoning enhances sequential text processing
- Shared attention mechanisms capture long-range dependencies
- Common representation space enables zero-shot cross-domain transfer

