Data Preparation

Data Preparation Configuration

DataPreparationConfig

The DataPreparationConfig class defines how raw data should be processed and tokenized for training.

input_path

str or readable object

required

Source location for raw data files or a readable object from which data can be accessed.

output_path

str or writable object

required

Destination location for processed data files or a writable object where processed data will be stored.

tokenization

str, TimeSeriesTokenizerConfig, or callable

required

Tokenization strategy based on data modality:

Text/string/code: String denoting a Hugging Face tokenizer name or path to a local tokenizer
Time series: An instance of TimeSeriesTokenizerConfig
Custom tokenization: A callable function that can map string → tensor of floats/integers (2D for patch-based)

max_sequence_length

int

default:"4096"

Maximum sequence length for tokenized data.

TimeSeriesTokenizerConfig

Configuration for time series-specific tokenization parameters.

type

str

required

Tokenization approach - "patch_based" (Chronos Bolt, TSFM, TiRex style) or "bin_quant_based" (Chronos style).

input_patch_size

int or None

default:"None"

Size of input patches. Required for patch-based tokenization.

output_patch_size

int or None

default:"None"

Size of output patches. Required for patch-based tokenization. Can differ from input patch size (e.g., TimesFM uses longer output patches than input).

patch_masking

bool or None

default:"False"

Enables patch masking strategy. Only applicable for patch-based tokenization. Helps models learn to predict well for context lengths that are multiples of input patch length (see TimesFM paper).

number_of_bins

int or None

default:"4096"

Number of quantization bins. Required for bin quantization-based tokenization.

normalization_method

str or callable

default:"z-norm"

Normalization strategy. Accepts custom function that maps a list of numbers to a list of numbers.

normalization_mean

float or None

default:"None"

Mean value for normalization. Should remain None for custom normalization methods. For z-norm, computes mean over first patch (patch-based tokenization, following TimesFM) or entire series (bin quantization, following Chronos).

normalization_std

float or None

default:"None"

Standard deviation for normalization. Computes series-wise standard deviation based on first patch (patch-based) or entire series (bin quantization).

The system automatically handles padding, missing values, and end-of-sequence tokens.

Supported Data Formats and Modalities

Data Formats
Supported Modalities

Text/Code: JSONL files containing a list of dictionaries, each with a text key
Time Series: AutoGluonTS compatible formats (format specification coming soon)

Example Usage

from pynolano import DataPreparationConfig

def build() -> DataPreparationConfig:
    return DataPreparationConfig(
        input_path="./text_data.jsonl",
        output_path="./prepared_text",
        tokenization="Qwen/Qwen3-4B",
        max_sequence_length=2048
    )

Get Started

Core Concepts

Tutorials