Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.nolano.ai/llms.txt

Use this file to discover all available pages before exploring further.

Data Preparation Configuration

DataPreparationConfig

The DataPreparationConfig class defines how raw data should be processed and tokenized for training.
input_path
str or readable object
required
Source location for raw data files or a readable object from which data can be accessed.
output_path
str or writable object
required
Destination location for processed data files or a writable object where processed data will be stored.
tokenization
str, TimeSeriesTokenizerConfig, or callable
required
Tokenization strategy based on data modality:
  • Text/string/code: String denoting a Hugging Face tokenizer name or path to a local tokenizer
  • Time series: An instance of TimeSeriesTokenizerConfig
  • Custom tokenization: A callable function that can map string → tensor of floats/integers (2D for patch-based)
max_sequence_length
int
default:"4096"
Maximum sequence length for tokenized data.

TimeSeriesTokenizerConfig

Configuration for time series-specific tokenization parameters.
type
str
required
Tokenization approach - "patch_based" (Chronos Bolt, TSFM, TiRex style) or "bin_quant_based" (Chronos style).
input_patch_size
int or None
default:"None"
Size of input patches. Required for patch-based tokenization.
output_patch_size
int or None
default:"None"
Size of output patches. Required for patch-based tokenization. Can differ from input patch size (e.g., TimesFM uses longer output patches than input).
patch_masking
bool or None
default:"False"
Enables patch masking strategy. Only applicable for patch-based tokenization. Helps models learn to predict well for context lengths that are multiples of input patch length (see TimesFM paper).
number_of_bins
int or None
default:"4096"
Number of quantization bins. Required for bin quantization-based tokenization.
normalization_method
str or callable
default:"z-norm"
Normalization strategy. Accepts custom function that maps a list of numbers to a list of numbers.
normalization_mean
float or None
default:"None"
Mean value for normalization. Should remain None for custom normalization methods. For z-norm, computes mean over first patch (patch-based tokenization, following TimesFM) or entire series (bin quantization, following Chronos).
normalization_std
float or None
default:"None"
Standard deviation for normalization. Computes series-wise standard deviation based on first patch (patch-based) or entire series (bin quantization).
The system automatically handles padding, missing values, and end-of-sequence tokens.

Supported Data Formats and Modalities

  • Text/Code: JSONL files containing a list of dictionaries, each with a text key
  • Time Series: AutoGluonTS compatible formats (format specification coming soon)

Example Usage

from pynolano import DataPreparationConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./text_data.jsonl",
        output_path="./prepared_text",
        tokenization="Qwen/Qwen3-4B",
        max_sequence_length=2048
    )