Evaluation & Inference

The evaluation system in Nolano.AI provides comprehensive tools for assessing model performance across different modalities and tasks. The platform supports both built-in evaluation metrics and custom evaluation functions.

EvaluationConfig

Configuration class for model evaluation and inference settings.

model_path

str

required

Path to the trained model checkpoint directory (e.g., /path/to/checkpoint/global_step_XXXXX)

data_config

DataConfig

required

Configuration for evaluation data. Similar to training data config but typically with validation_split=1.0

eval_metrics

str, List[str], or callable

default:"Auto-selected based on training objective"

Evaluation metrics to compute:

Text/code models: "perplexity", "accuracy", "bleu", "rouge"
Time series: "mse", "mae", "mape", "smape", "quantile_loss"
Custom callable functions with signature: (predictions, targets) → metric_value

batch_size

int

default:"32"

Batch size for evaluation.

output_predictions

bool

default:"False"

Whether to save predictions to file.

output_path

str or None

default:"model_path + '/evaluation'"

Directory to save evaluation results and predictions.

eval_steps

int or None

default:"None"

Maximum number of evaluation steps. Set to None for full dataset evaluation.

Built-in Evaluation Metrics

Text/Code Modality
Time Series Modality

Perplexity: Measures how well the model predicts the next token
Accuracy: Token-level or sequence-level accuracy
BLEU: Bilingual Evaluation Understudy score for text generation quality
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
CodeBLEU: Specialized BLEU variant for code generation

MSE: Mean Squared Error
MAE: Mean Absolute Error
MAPE: Mean Absolute Percentage Error
sMAPE: Symmetric Mean Absolute Percentage Error
Quantile Loss: For probabilistic forecasting models
CRPS: Continuous Ranked Probability Score

Evaluation Examples

# eval_config.py
from pynolano import EvaluationConfig, DataConfig

def build() -> EvaluationConfig:
    return EvaluationConfig(
        model_path="./checkpoints/global_step_1000",
        data_config=DataConfig(
            data_paths="./test_data",
            validation_split=1.0  # Use all data for evaluation
        ),
        eval_metrics=["perplexity", "accuracy", "bleu"],
        output_predictions=True,
        batch_size=16
    )

Running Evaluation

# Run evaluation with configuration file
nolano evaluate eval_config.py

Inference

InferenceConfig

Configuration class for model inference settings.

batch_size

int

default:"1"

Batch size for inference.

max_new_tokens

int

default:"512"

Maximum number of new tokens to generate (for generative models).

temperature

float

default:"1.0"

Sampling temperature for text generation. Higher values increase randomness.

top_p

float

default:"1.0"

Nucleus sampling parameter. Only consider tokens with cumulative probability up to this value.

top_k

int or None

default:"None"

Only consider the k most likely tokens at each step.

do_sample

bool

default:"True"

Whether to use sampling for generation. If False, uses greedy decoding.

repetition_penalty

float

default:"1.0"

Penalty for token repetition. Values > 1.0 discourage repetition.

length_penalty

float

default:"1.0"

Penalty for sequence length. Values > 1.0 encourage longer sequences.

device

str

default:"auto"

Device for inference (‘cuda’, ‘cpu’, ‘auto’).

Inference Examples

from pynolano import load_model, InferenceConfig

# Load trained model
model = load_model("./checkpoints/global_step_1000")

# Configure inference
inference_config = InferenceConfig(
    batch_size=1,
    max_new_tokens=512,  # For generative models
    temperature=0.7,     # For text generation
    top_p=0.9,
    do_sample=True
)

# Run inference
results = model.generate(
    inputs=["Your input text here"],
    config=inference_config
)

Evaluation Output

Evaluation results are saved in JSON format with the following structure:

{
    "model_info": {
        "model_path": "./checkpoints/global_step_1000",
        "model_config": {...},
        "evaluation_timestamp": "2024-01-15T10:30:00Z"
    },
    "dataset_info": {
        "num_samples": 10000,
        "data_paths": ["./test_data"],
        "preprocessing_config": {...}
    },
    "metrics": {
        "perplexity": 3.24,
        "accuracy": 0.876,
        "bleu": 0.445,
        "loss": 1.175
    },
    "detailed_results": {
        "per_sample_metrics": [...],
        "confidence_intervals": {...}
    }
}

This comprehensive evaluation system enables thorough assessment of model performance and supports iterative improvement of your foundation models.

Get Started

Core Concepts

Tutorials

Evaluation & Inference