> ## Documentation Index > Fetch the complete documentation index at: https://internal.nolano.ai/llms.txt > Use this file to discover all available pages before exploring further. # Data Preparation > Configure and process your data for training ## Data Preparation Configuration ### DataPreparationConfig The `DataPreparationConfig` class defines how raw data should be processed and tokenized for training. Source location for raw data files or a readable object from which data can be accessed. Destination location for processed data files or a writable object where processed data will be stored. Tokenization strategy based on data modality: * **Text/string/code**: String denoting a Hugging Face tokenizer name or path to a local tokenizer * **Time series**: An instance of `TimeSeriesTokenizerConfig` * **Custom tokenization**: A callable function that can map string → tensor of floats/integers (2D for patch-based) Maximum sequence length for tokenized data. ### TimeSeriesTokenizerConfig Configuration for time series-specific tokenization parameters. Tokenization approach - `"patch_based"` (Chronos Bolt, TSFM, TiRex style) or `"bin_quant_based"` (Chronos style). Size of input patches. Required for patch-based tokenization. Size of output patches. Required for patch-based tokenization. Can differ from input patch size (e.g., TimesFM uses longer output patches than input). Enables patch masking strategy. Only applicable for patch-based tokenization. Helps models learn to predict well for context lengths that are multiples of input patch length (see TimesFM paper). Number of quantization bins. Required for bin quantization-based tokenization. Normalization strategy. Accepts custom function that maps a list of numbers to a list of numbers. Mean value for normalization. Should remain `None` for custom normalization methods. For z-norm, computes mean over first patch (patch-based tokenization, following TimesFM) or entire series (bin quantization, following Chronos). Standard deviation for normalization. Computes series-wise standard deviation based on first patch (patch-based) or entire series (bin quantization). The system automatically handles padding, missing values, and end-of-sequence tokens. ## Supported Data Formats and Modalities * **Text/Code**: JSONL files containing a list of dictionaries, each with a `text` key * **Time Series**: AutoGluonTS compatible formats (format specification coming soon) * Text * Code * Time Series (Univariate + Multivariate) Coming soon: Multimodal (Text + Time Series), Irregularly Sampled Time Series ## Example Usage ```python Text Data theme={null} from pynolano import DataPreparationConfig def build() -> DataPreparationConfig: return DataPreparationConfig( input_path="./text_data.jsonl", output_path="./prepared_text", tokenization="Qwen/Qwen3-4B", max_sequence_length=2048 ) ``` ```python Time Series (Patch-based) theme={null} from pynolano import DataPreparationConfig, TimeSeriesTokenizerConfig def build() -> DataPreparationConfig: return DataPreparationConfig( input_path="./time_series_data", output_path="./prepared_ts", tokenization=TimeSeriesTokenizerConfig( type="patch_based", input_patch_size=32, output_patch_size=32, patch_masking=True, normalization_method="z-norm" ) ) ``` ```python Time Series (Bin Quantization) theme={null} from pynolano import DataPreparationConfig, TimeSeriesTokenizerConfig def build() -> DataPreparationConfig: return DataPreparationConfig( input_path="./time_series_data", output_path="./prepared_ts", tokenization=TimeSeriesTokenizerConfig( type="bin_quant_based", number_of_bins=4096, normalization_method="z-norm" ) ) ``` ```python Custom Tokenization theme={null} def custom_tokenizer(text): # Your custom tokenization logic return tensor_output def build() -> DataPreparationConfig: return DataPreparationConfig( input_path="./custom_data", output_path="./prepared_custom", tokenization=custom_tokenizer ) ```