Data Preparation Configuration
DataPreparationConfig
TheDataPreparationConfig class defines how raw data should be processed and tokenized for training.
Source location for raw data files or a readable object from which data can be accessed.
Destination location for processed data files or a writable object where processed data will be stored.
Tokenization strategy based on data modality:
- Text/string/code: String denoting a Hugging Face tokenizer name or path to a local tokenizer
- Time series: An instance of
TimeSeriesTokenizerConfig - Custom tokenization: A callable function that can map string → tensor of floats/integers (2D for patch-based)
Maximum sequence length for tokenized data.
TimeSeriesTokenizerConfig
Configuration for time series-specific tokenization parameters.Tokenization approach -
"patch_based" (Chronos Bolt, TSFM, TiRex style) or "bin_quant_based" (Chronos style).Size of input patches. Required for patch-based tokenization.
Size of output patches. Required for patch-based tokenization. Can differ from input patch size (e.g., TimesFM uses longer output patches than input).
Enables patch masking strategy. Only applicable for patch-based tokenization. Helps models learn to predict well for context lengths that are multiples of input patch length (see TimesFM paper).
Number of quantization bins. Required for bin quantization-based tokenization.
Normalization strategy. Accepts custom function that maps a list of numbers to a list of numbers.
Mean value for normalization. Should remain
None for custom normalization methods. For z-norm, computes mean over first patch (patch-based tokenization, following TimesFM) or entire series (bin quantization, following Chronos).Standard deviation for normalization. Computes series-wise standard deviation based on first patch (patch-based) or entire series (bin quantization).
The system automatically handles padding, missing values, and end-of-sequence tokens.
Supported Data Formats and Modalities
- Data Formats
- Supported Modalities
- Text/Code: JSONL files containing a list of dictionaries, each with a
textkey - Time Series: AutoGluonTS compatible formats (format specification coming soon)

