Skip to main content

EvaluationConfig

model_path
str
required
Path to the trained model checkpoint directory (e.g., /path/to/checkpoint/global_step_XXXXX)
data_config
DataConfig
required
Configuration for evaluation data. Similar to training data config but typically with validation_split=1.0
eval_metrics
str | List[str] | callable
default:"Auto-selected based on training objective"
Evaluation metrics to compute:
  • For text/code models: "perplexity", "accuracy", "bleu", "rouge"
  • For time series: "mse", "mae", "mape", "smape", "quantile_loss"
  • Custom callable functions with signature: (predictions, targets) → metric_value
batch_size
int
default:"32"
Batch size for evaluation.
output_predictions
bool
default:"False"
Whether to save predictions to file.
output_path
str | None
default:"model_path + '/evaluation'"
Directory to save evaluation results and predictions.
eval_steps
int | None
default:"None"
Maximum number of evaluation steps. Set to None for full dataset evaluation.

InferenceConfig

batch_size
int
default:"1"
Batch size for inference.
max_new_tokens
int
default:"512"
Maximum number of new tokens to generate (for generative models).
temperature
float
default:"1.0"
Sampling temperature for text generation. Higher values increase randomness.
top_p
float
default:"1.0"
Nucleus sampling parameter. Only consider tokens with cumulative probability up to this value.
top_k
int | None
default:"None"
Only consider the k most likely tokens at each step.
do_sample
bool
default:"True"
Whether to use sampling for generation. If False, uses greedy decoding.
repetition_penalty
float
default:"1.0"
Penalty for token repetition. Values > 1.0 discourage repetition.
length_penalty
float
default:"1.0"
Penalty for sequence length. Values > 1.0 encourage longer sequences.
device
str
default:"auto"
Device for inference (‘cuda’, ‘cpu’, ‘auto’).