Documentation Index
Fetch the complete documentation index at: https://internal.nolano.ai/llms.txt
Use this file to discover all available pages before exploring further.
Data Preparation Configuration
DataPreparationConfig
TheDataPreparationConfig class defines how raw data should be processed and tokenized for training.
Source location for raw data files or a readable object from which data can be accessed.
Destination location for processed data files or a writable object where processed data will be stored.
Tokenization strategy based on data modality:
- Text/string/code: String denoting a Hugging Face tokenizer name or path to a local tokenizer
- Time series: An instance of
TimeSeriesTokenizerConfig - Custom tokenization: A callable function that can map string → tensor of floats/integers (2D for patch-based)
Maximum sequence length for tokenized data.
TimeSeriesTokenizerConfig
Configuration for time series-specific tokenization parameters.Tokenization approach -
"patch_based" (Chronos Bolt, TSFM, TiRex style) or "bin_quant_based" (Chronos style).Size of input patches. Required for patch-based tokenization.
Size of output patches. Required for patch-based tokenization. Can differ from input patch size (e.g., TimesFM uses longer output patches than input).
Enables patch masking strategy. Only applicable for patch-based tokenization. Helps models learn to predict well for context lengths that are multiples of input patch length (see TimesFM paper).
Number of quantization bins. Required for bin quantization-based tokenization.
Normalization strategy. Accepts custom function that maps a list of numbers to a list of numbers.
Mean value for normalization. Should remain
None for custom normalization methods. For z-norm, computes mean over first patch (patch-based tokenization, following TimesFM) or entire series (bin quantization, following Chronos).Standard deviation for normalization. Computes series-wise standard deviation based on first patch (patch-based) or entire series (bin quantization).
The system automatically handles padding, missing values, and end-of-sequence tokens.
Supported Data Formats and Modalities
- Data Formats
- Supported Modalities
- Text/Code: JSONL files containing a list of dictionaries, each with a
textkey - Time Series: AutoGluonTS compatible formats (format specification coming soon)

