Skip to main content

Evaluation & Inference

The evaluation system in Nolano.AI provides comprehensive tools for assessing model performance across different modalities and tasks. The platform supports both built-in evaluation metrics and custom evaluation functions.

EvaluationConfig

Configuration class for model evaluation and inference settings.
model_path
str
required
Path to the trained model checkpoint directory (e.g., /path/to/checkpoint/global_step_XXXXX)
data_config
DataConfig
required
Configuration for evaluation data. Similar to training data config but typically with validation_split=1.0
eval_metrics
str, List[str], or callable
default:"Auto-selected based on training objective"
Evaluation metrics to compute:
  • Text/code models: "perplexity", "accuracy", "bleu", "rouge"
  • Time series: "mse", "mae", "mape", "smape", "quantile_loss"
  • Custom callable functions with signature: (predictions, targets) → metric_value
batch_size
int
default:"32"
Batch size for evaluation.
output_predictions
bool
default:"False"
Whether to save predictions to file.
output_path
str or None
default:"model_path + '/evaluation'"
Directory to save evaluation results and predictions.
eval_steps
int or None
default:"None"
Maximum number of evaluation steps. Set to None for full dataset evaluation.

Built-in Evaluation Metrics

  • Perplexity: Measures how well the model predicts the next token
  • Accuracy: Token-level or sequence-level accuracy
  • BLEU: Bilingual Evaluation Understudy score for text generation quality
  • ROUGE: Recall-Oriented Understudy for Gisting Evaluation
  • CodeBLEU: Specialized BLEU variant for code generation

Evaluation Examples

# eval_config.py
from pynolano import EvaluationConfig, DataConfig

def build() -> EvaluationConfig:
    return EvaluationConfig(
        model_path="./checkpoints/global_step_1000",
        data_config=DataConfig(
            data_paths="./test_data",
            validation_split=1.0  # Use all data for evaluation
        ),
        eval_metrics=["perplexity", "accuracy", "bleu"],
        output_predictions=True,
        batch_size=16
    )

Running Evaluation

# Run evaluation with configuration file
nolano evaluate eval_config.py

Inference

InferenceConfig

Configuration class for model inference settings.
batch_size
int
default:"1"
Batch size for inference.
max_new_tokens
int
default:"512"
Maximum number of new tokens to generate (for generative models).
temperature
float
default:"1.0"
Sampling temperature for text generation. Higher values increase randomness.
top_p
float
default:"1.0"
Nucleus sampling parameter. Only consider tokens with cumulative probability up to this value.
top_k
int or None
default:"None"
Only consider the k most likely tokens at each step.
do_sample
bool
default:"True"
Whether to use sampling for generation. If False, uses greedy decoding.
repetition_penalty
float
default:"1.0"
Penalty for token repetition. Values > 1.0 discourage repetition.
length_penalty
float
default:"1.0"
Penalty for sequence length. Values > 1.0 encourage longer sequences.
device
str
default:"auto"
Device for inference (‘cuda’, ‘cpu’, ‘auto’).

Inference Examples

from pynolano import load_model, InferenceConfig

# Load trained model
model = load_model("./checkpoints/global_step_1000")

# Configure inference
inference_config = InferenceConfig(
    batch_size=1,
    max_new_tokens=512,  # For generative models
    temperature=0.7,     # For text generation
    top_p=0.9,
    do_sample=True
)

# Run inference
results = model.generate(
    inputs=["Your input text here"],
    config=inference_config
)

Evaluation Output

Evaluation results are saved in JSON format with the following structure:
{
    "model_info": {
        "model_path": "./checkpoints/global_step_1000",
        "model_config": {...},
        "evaluation_timestamp": "2024-01-15T10:30:00Z"
    },
    "dataset_info": {
        "num_samples": 10000,
        "data_paths": ["./test_data"],
        "preprocessing_config": {...}
    },
    "metrics": {
        "perplexity": 3.24,
        "accuracy": 0.876,
        "bleu": 0.445,
        "loss": 1.175
    },
    "detailed_results": {
        "per_sample_metrics": [...],
        "confidence_intervals": {...}
    }
}
This comprehensive evaluation system enables thorough assessment of model performance and supports iterative improvement of your foundation models.