> ## Documentation Index > Fetch the complete documentation index at: https://internal.nolano.ai/llms.txt > Use this file to discover all available pages before exploring further. # Evaluation & Inference > Comprehensive model evaluation and inference tools ## Evaluation & Inference The evaluation system in Nolano.AI provides comprehensive tools for assessing model performance across different modalities and tasks. The platform supports both built-in evaluation metrics and custom evaluation functions. ### EvaluationConfig Configuration class for model evaluation and inference settings. Path to the trained model checkpoint directory (e.g., `/path/to/checkpoint/global_step_XXXXX`) Configuration for evaluation data. Similar to training data config but typically with `validation_split=1.0` Evaluation metrics to compute: * **Text/code models**: `"perplexity"`, `"accuracy"`, `"bleu"`, `"rouge"` * **Time series**: `"mse"`, `"mae"`, `"mape"`, `"smape"`, `"quantile_loss"` * **Custom callable** functions with signature: `(predictions, targets) → metric_value` Batch size for evaluation. Whether to save predictions to file. Directory to save evaluation results and predictions. Maximum number of evaluation steps. Set to `None` for full dataset evaluation. ### Built-in Evaluation Metrics * `Perplexity`: Measures how well the model predicts the next token * `Accuracy`: Token-level or sequence-level accuracy * `BLEU`: Bilingual Evaluation Understudy score for text generation quality * `ROUGE`: Recall-Oriented Understudy for Gisting Evaluation * `CodeBLEU`: Specialized BLEU variant for code generation * `MSE`: Mean Squared Error * `MAE`: Mean Absolute Error * `MAPE`: Mean Absolute Percentage Error * `sMAPE`: Symmetric Mean Absolute Percentage Error * `Quantile Loss`: For probabilistic forecasting models * `CRPS`: Continuous Ranked Probability Score ### Evaluation Examples ```python Text Model Evaluation theme={null} # eval_config.py from pynolano import EvaluationConfig, DataConfig def build() -> EvaluationConfig: return EvaluationConfig( model_path="./checkpoints/global_step_1000", data_config=DataConfig( data_paths="./test_data", validation_split=1.0 # Use all data for evaluation ), eval_metrics=["perplexity", "accuracy", "bleu"], output_predictions=True, batch_size=16 ) ``` ```python Time Series Forecasting Evaluation theme={null} # forecast_eval_config.py from pynolano import EvaluationConfig, DataConfig def build() -> EvaluationConfig: return EvaluationConfig( model_path="./ts_model/global_step_5000", data_config=DataConfig( data_paths="./test_series", validation_split=1.0 ), eval_metrics=["mse", "mae", "mape", "quantile_loss"], output_predictions=True, output_path="./evaluation_results" ) ``` ```python Custom Evaluation Metrics theme={null} def custom_accuracy(predictions, targets): """Custom accuracy metric with specific threshold""" correct = torch.abs(predictions - targets) < 0.1 return correct.float().mean().item() def build() -> EvaluationConfig: return EvaluationConfig( model_path="./model_checkpoint", data_config=DataConfig(data_paths="./eval_data", validation_split=1.0), eval_metrics=[custom_accuracy, "mse"], batch_size=32 ) ``` ### Running Evaluation ```bash Configuration File theme={null} # Run evaluation with configuration file nolano evaluate eval_config.py ``` ```bash Command Line theme={null} # Quick evaluation with command line arguments nolano evaluate --model_path ./checkpoints/global_step_1000 --data_path ./test_data --metrics perplexity accuracy ``` ```bash Custom Output theme={null} # Evaluation with specific output directory nolano evaluate eval_config.py --output_path ./custom_eval_results ``` ## Inference ### InferenceConfig Configuration class for model inference settings. Batch size for inference. Maximum number of new tokens to generate (for generative models). Sampling temperature for text generation. Higher values increase randomness. Nucleus sampling parameter. Only consider tokens with cumulative probability up to this value. Only consider the k most likely tokens at each step. Whether to use sampling for generation. If False, uses greedy decoding. Penalty for token repetition. Values > 1.0 discourage repetition. Penalty for sequence length. Values > 1.0 encourage longer sequences. Device for inference ('cuda', 'cpu', 'auto'). ### Inference Examples ```python Basic Inference theme={null} from pynolano import load_model, InferenceConfig # Load trained model model = load_model("./checkpoints/global_step_1000") # Configure inference inference_config = InferenceConfig( batch_size=1, max_new_tokens=512, # For generative models temperature=0.7, # For text generation top_p=0.9, do_sample=True ) # Run inference results = model.generate( inputs=["Your input text here"], config=inference_config ) ``` ```python Batch Inference theme={null} # For large-scale inference from pynolano import BatchInference batch_inference = BatchInference( model_path="./checkpoints/global_step_1000", input_path="./inference_data", output_path="./inference_results", batch_size=32, device="cuda" ) # Process all data batch_inference.run() ``` ```python Time Series Forecasting theme={null} # Specialized inference for time series forecasting from pynolano import TimeSeriesForecaster forecaster = TimeSeriesForecaster( model_path="./ts_model/global_step_5000", forecast_horizon=24, # Number of steps to predict confidence_intervals=True ) # Generate forecasts forecasts = forecaster.predict( historical_data=your_time_series_data, prediction_length=24 ) ``` ### Evaluation Output Evaluation results are saved in JSON format with the following structure: ```json theme={null} { "model_info": { "model_path": "./checkpoints/global_step_1000", "model_config": {...}, "evaluation_timestamp": "2024-01-15T10:30:00Z" }, "dataset_info": { "num_samples": 10000, "data_paths": ["./test_data"], "preprocessing_config": {...} }, "metrics": { "perplexity": 3.24, "accuracy": 0.876, "bleu": 0.445, "loss": 1.175 }, "detailed_results": { "per_sample_metrics": [...], "confidence_intervals": {...} } } ``` This comprehensive evaluation system enables thorough assessment of model performance and supports iterative improvement of your foundation models.