Skip to main content

Training Large Language Models for Text Generation

This comprehensive tutorial demonstrates the platform’s capabilities for training text generation models from scratch. We’ll walk through building a large language model with advanced features including ternary precision, Mixture of Experts (MoE) architecture, and multi-task training across different data sources.

What You’ll Learn

In this tutorial, we’ll cover:
  • Multi-source data preparation with weighted sampling
  • Large-scale model training using the Qwen3-30B-A3B architecture
  • Advanced optimization techniques including ternary precision and MoE load balancing
  • Custom regularization with z-loss and additional loss functions
  • Platform automation features like GPU scaling and checkpointing
This tutorial showcases advanced features. If you’re new to the platform, consider starting with our quickstart guide first.

Prerequisites

1

Access to Nolano.AI

Ensure you have access to the Nolano.AI platform. Contact [email protected] if you need access.
2

Prepare Your Data

Your text data should be in JSONL format with each line containing a dictionary with a text key:
{"text": "Natural language text for general training."}
{"text": "Domain-specific content for specialized training."}
{"text": "Various text sources to demonstrate multi-task learning."}
For this tutorial, we’ll use two separate data sources to demonstrate multi-task training.

Step 1: Data Preparation

We’ll prepare two separate datasets that will be combined during training with different sampling weights. This demonstrates the platform’s multi-task learning capabilities.

Prepare General Text Data

First, let’s prepare our general text dataset (70% of training data):
general_data_config.py
from pynolano import DataPreparationConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./general_text_data.jsonl",
        output_path="./prepared_general_text",
        tokenization="Qwen/Qwen2.5-32B-Instruct",  # Using Qwen tokenizer to match our model
        max_sequence_length=4096
    )

Prepare Domain-Specific Data

Next, prepare the domain-specific dataset (30% of training data):
domain_data_config.py
from pynolano import DataPreparationConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./domain_specific_data.jsonl",
        output_path="./prepared_domain_data",
        tokenization="Qwen/Qwen2.5-32B-Instruct",  # Same tokenizer for consistency
        max_sequence_length=4096
    )

Run Data Preparation

Prepare both datasets:
# Prepare general text data
nolano prepare_data general_data_config.py

# Prepare domain-specific data  
nolano prepare_data domain_data_config.py
The platform automatically handles data validation, tokenization, and optimization for efficient training. Each prepared dataset will include metadata about sequence lengths, token distributions, and validation splits.

Step 2: Training Configuration

We’ll create two training configurations to demonstrate the platform’s flexibility. We’ll start with the simplest possible setup to help you understand the fundamentals, then progress to an advanced configuration that showcases enterprise-grade features.

Simple Training Configuration

Let’s start with the most basic configuration possible to understand the fundamentals. This uses a single data source and minimal configuration:
simple_train_config.py
from pynolano import (
    ExperimentConfig, 
    DataConfig, 
    ModelConfig, 
    OptimizationConfig,
    MetaConfig
)

def build() -> ExperimentConfig:
    return ExperimentConfig(
        # Single data source - the simplest setup
        data_configs=[
            DataConfig(
                data_paths="./prepared_general_text",
                training_objective="cross_entropy",
                validation_split=0.1
            )
        ],
        model_config=ModelConfig(
            architecture="Qwen/Qwen2.5-1.5B",  # Very small model for quick training
            init_method="normal"  # Train from scratch
        ),
        optimization_config=OptimizationConfig(
            total_training_steps=5000,  # Shorter training for quick results
            max_learning_rate=3e-4,
            global_batch_size=16,  # Smaller batch size
            learning_rate_schedule="cosine",
            warmup_steps=500
        ),
        meta_config=MetaConfig(
            name="simple-text-generation",
            model_save_frequency=1000,
            max_checkpoints=2
        )
    )

Transitioning to Advanced Features

Now that you understand the basics, let’s explore what makes our platform truly powerful. The advanced configuration below demonstrates enterprise-grade features that enable training state-of-the-art models:
  • Multi-task learning with weighted data sampling (70%/30% split)
  • Large-scale architectures (30B parameters with Mixture of Experts)
  • Cutting-edge precision (ternary/1.58-bit training)
  • Advanced regularization (z-loss and load balancing)
  • Production optimizations (distributed training, checkpointing)

Advanced Training Configuration

Here’s the full configuration showcasing the platform’s advanced capabilities:
advanced_train_config.py
from pynolano import (
    ExperimentConfig, 
    DataConfig, 
    ModelConfig, 
    OptimizationConfig,
    MetaConfig
)

def build() -> ExperimentConfig:
    return ExperimentConfig(
        # Multi-task data configuration with precise sampling weights
        data_configs=[
            DataConfig(
                data_paths="./prepared_general_text",
                training_objective="cross_entropy",
                sampling_weight=0.7,  # Exactly 70% as specified
                validation_split=0.1
            ),
            DataConfig(
                data_paths="./prepared_domain_data",
                training_objective="cross_entropy", 
                sampling_weight=0.3,  # Exactly 30% as specified
                validation_split=0.1
            )
        ],
        model_config=ModelConfig(
            architecture="Qwen/Qwen3-30B-A3B",  # Large MoE architecture as specified
            init_method="normal",  # Training from scratch as specified
            precision="ternary"  # Ternary precision (1.58-bit) as specified
        ),
        optimization_config=OptimizationConfig(
            total_training_steps=50000,
            max_learning_rate=1e-4,
            global_batch_size=128,
            learning_rate_schedule="cosine",
            warmup_steps=5000,
            weight_decay=0.01,
            z_loss=0.1,  # Z-loss regularization as specified
            load_balancing=0.1  # MoE load balancing coefficient as specified
        ),
        meta_config=MetaConfig(
            name="advanced-text-generation-qwen3-30b",
            model_save_frequency=5000,
            max_checkpoints=5,
            seed=42
        )
    )
Large Model Training: The Qwen3-30B-A3B model requires significant computational resources. The platform will automatically scale GPU resources and manage efficeint distributed training across multiple nodes.

Step 3: Start Training

Simple Model Training

Start with the simple configuration to get familiar with the platform:
nolano train simple_train_config.py
This will train a small 1.5B parameter model for 5,000 steps using a single data source. Perfect for learning the basics and seeing quick results!

Advanced Model Training

Once you’re comfortable with the simple setup, try the advanced configuration with all enterprise features:
nolano train advanced_train_config.py
This trains a production-scale 30B parameter model with multi-task learning, ternary precision, and MoE architecture - showcasing the platform’s full capabilities.
Auto-scaling: The platform automatically detects your model size and handling scaling techniques across GPU resources accordingly. For the 30B parameter model, we recommend using atleast 8x H100 GPUs.

Step 4: Batch Inference for Production

For large-scale text generation:
batch_inference.py
from pynolano import BatchInference

# Set up batch inference
batch_generator = BatchInference(
    model_path="./advanced-text-generation-qwen3-30b/global_step_50000",
    input_path="./inference_prompts.jsonl",  # JSONL with {"prompt": "..."} format
    output_path="./generated_outputs",
    batch_size=32,
    device="cuda"
)

# Configure generation parameters
generation_config = {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": True
}

# Process all prompts
batch_generator.run(config=generation_config)

Advanced Training Capabilities

The tutorial you just completed showcases several cutting-edge features:
1.58-bit Precision for Efficient Large ModelsOur ternary precision training reduces memory usage by ~10x while maintaining performance:
  • Automatic gradient scaling and stability
  • Mixed-precision optimizations
  • Hardware-accelerated ternary operations
  • Seamless integration with any model architecture
Intelligent Expert Load BalancingThe platform automatically optimizes MoE training:
  • Load balancing coefficient tuning (we used 0.1)
  • Expert utilization monitoring
  • Routing efficiency optimization
  • Scalable expert parallelism
Sophisticated Data Pipeline OrchestrationYour 70%/30% data split demonstrates:
  • Precise sampling weight control
  • Cross-entropy loss optimization per data source
  • Automatic data validation and consistency checks
  • Dynamic batch composition for optimal learning
Z-loss and Custom RegularizationThe z-loss=0.1 configuration provides:
  • Improved training stability for large models
  • Better gradient flow in deep architectures
  • Reduced activation magnitude variance
  • Enhanced convergence properties