Initialization

This page covers everything you need to know about the Initialization Box and the nodes you can use inside it.

Think of the Initialization Box (see example below) as your starting point, which holds all the key components needed to set up your machine learning pipeline. Inside, can use four essential nodes:

Dataset: Define your pipeline's data to get started.
Clean: Tidy up your data for better results.
Split: Divide your data into training and testing.
Model: Select and configure your machine learning model.

Dataset Node: Your Experiment's Starting Point

The Dataset node marks the beginning of your experiment; here, you define the data that your pipeline will use. In the Machine Learning workflow, this represents the learning set as depicted below:

Available Data Types

You have two flexible options to load your data:

MEDomics Standard
- Automatically pulls files from a designated learning folder (typically pre-processed .csv files from the MEDprofiles workflow).
- The node detects compatible files and lists them in a dropdown menu.
- Select your file(s), then specify the target column (the variable you want to predict).
- Pro Tip: If selecting multiple files, ensure they all share the same target column.
Custom File
- Upload any .csv file from your workspace using the dropdown selector.
- Just like with MEDomics Standard, choose your target column to define the prediction goal.

Node's options

The options in this node mirror the non-cleaning-related parameters of PyCaret’s setup() function.

Clean node: Tidy and transform your data

This node helps you tidy and transform your dataset before model training. Use it to handle common data issues, such as missing values, scaling, and more, so your model receives the best possible input. In the machine learning workflow, the Cleaning node is used to define the learning set step, as depicted below:

The available options for this node correspond to PyCaret's setup() function options specifically designed for data cleaning.

Split node: Define your train and test partitions

This node is essential for designing how your learning set will be divided for training and testing. Without it, models default to a single random subsampling iteration. In the new architecture, the Split node is used in the learning set partitioning step, as shown below:

Proper data partitioning prevents information leakage and gives reliable performance estimates—critical for trustworthy ML results. You can choose from these partitioning methods:

Cross-Validation (K-Fold)
- Divides data into K equal folds, using K-1 for training and 1 for testing in each iteration
- Ideal for: Small-to-medium datasets, maximizing data usage
- Common practice: 5-fold or 10-fold configurations
- Options to set:
  - num_folds: number of folds to use (K).
Random Sub-Sampling
- Splits data randomly into fixed train/test percentages (e.g., 80%/20%)
- Ideal for: Large datasets, quick prototyping
- Tip: Stratified sampling maintains class proportions
- Options to set:
  - test_size: Proportion of the data to allocate for the testing set (must be between 0 and 1).
  - n_iterations: Number of repetitions, i.e. number of splits to create. Increasing the repetitions can reduce the uncertainty in the performance estimates.
Bootstrapping
- Creates multiple samples with replacement, then aggregates results
- Ideal for: Very small datasets, estimating model stability
- Advantage: Simulates having more data than available
- Options to set:
  - bootstrap_train_sample_size: The proportion of the dataset to resample with replacement.
  - n_iterations: Number of bootstraps/splits to create. A higher number of iterations can reduce the uncertainty in the performance estimates.
User-Defined
- Manually specify training/validation indices or custom splitting logic
- Ideal for: Time-series data, special evaluation schemes
- Flexibility: Import predefined splits or implement unique rules
- Options to set:
  - train_indices: List of training indices for the training set
  - test_indices: List of testing indices for the testing set

General Settings for the Split Node

Before running your experiment, configure these essential settings to control how your data is partitioned:

1. Random State (random_state)

Purpose: Ensures reproducible splits by initializing the random number generator with a fixed seed.
Usage:
- Enter an integer value (e.g., 42) to make split results consistent across runs.
- Leave blank for truly random splits (not recommended for reproducible experiments).

2. Stratify Columns (stratify_columns)

Purpose: Maintains the original distribution of key variables (e.g., class labels) in both training and test sets and prevents skewed splits that could bias model evaluation.
Requirements:
- At least one column must be selected.
- Common choices: Target variables or demographic columns (e.g., age groups, gender).

3. Use Tags for Stratification

Purpose: Leverages predefined Column Tags or Row Tags as stratification groups.
How It Works:
1. Toggle this option to activate tag-based stratification.
2. Available tags from your dataset will auto-populate in a dropdown menu.
3. Select one or more tags to use as stratification criteria.
Key Notes:
- Compatible with stratify_columns (can be used simultaneously).
- If no tags exist, the system will display a warning, but it will not impact execution.
- Tags are especially useful for complex stratification schemes (e.g., multi-label scenarios).

Unlike other nodes, the Split node has no extra options

Model node: Select and configure your machine learning algorithm

This node enables you to select and customize your machine learning model. The available models and their parameters directly correspond to:

The estimator parameter in PyCaret's create_model() function
Scikit-learn's comprehensive model implementations

The Model node is used in the algorithm selection step, as shown below:

On the next page, you will learn about the Training Box, which will help you define the training process of your experiment.

PreviousLearning Module NextTraining

Last updated 1 month ago