# Initialization

Think of the Initialization Box (see example below) as your starting point, which holds all the key components needed to set up your machine learning pipeline. Inside, can use four essential nodes:

* **Dataset**: Define your pipeline's data to get started.
* **Clean**: Tidy up your data for better results.
* **Split**: Divide your data into training and testing.
* **Model**: Select and configure your machine learning model.

<figure><img src="/files/6Xf0dYXukIjmWLYUp0Hj" alt=""><figcaption><p>Example of an Initialization Box</p></figcaption></figure>

## **Dataset Node: Your Experiment's Starting Point**

The **Dataset** node marks the beginning of your experiment; here, you define the data that your pipeline will use. In the Machine Learning workflow, this represents the learning set as depicted below:

<figure><img src="/files/E0ZJBHhMWuXxfT05rEGK" alt=""><figcaption></figcaption></figure>

**Available Data Types**

You have two flexible options to load your data:

1. **MEDomics Standard**
   * Automatically pulls files from a designated learning folder (typically pre-processed `.csv` files from the [MEDprofiles](/medomics-docs/tutorials/design/input-module/medprofiles.md) workflow).
   * The node detects compatible files and lists them in a dropdown menu.
   * Select your file(s), then specify the **target column** (the variable you want to predict).
   * *Pro Tip:* If selecting multiple files, ensure they all share the same target column.
2. **Custom File**
   * Upload any `.csv` file from your workspace using the dropdown selector.
   * Just like with MEDomics Standard, choose your **target column** to define the prediction goal.

**Node's options**

The options in this node mirror the non-cleaning-related parameters of [PyCaret’s `setup()` function](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.setup).

<figure><img src="/files/vxjGmPUThNKGKvSYmD9O" alt=""><figcaption><p>Breakdown of the Dataset node</p></figcaption></figure>

## Clean node: Tidy and transform your data

This node helps you tidy and transform your dataset before model training. Use it to handle common data issues, such as missing values, scaling, and more, so your model receives the best possible input. In the machine learning workflow, the Cleaning node is used to define the learning set step, as depicted below:

<figure><img src="/files/swojvgYX5TaKW5lUK0Py" alt=""><figcaption></figcaption></figure>

The available options for this node correspond to [*PyCaret's* setu&#x70;*()* function options](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.setup) specifically designed for data cleaning.

<figure><img src="/files/BNyGpgQ2MFpSpTKBHcgy" alt=""><figcaption><p>Breakdown of the Clean node</p></figcaption></figure>

## Split node: Define your train and test partitions

This node is essential for designing how your learning set will be divided for training and testing. Without it, models default to a single **random subsampling** **iteration**. In the new architecture, the Split node is used in the learning set partitioning step, as shown below:

<figure><img src="/files/gXRP1Bmxe6tzc9qdWYUa" alt=""><figcaption></figcaption></figure>

Proper data partitioning prevents information leakage and gives reliable performance estimates—critical for trustworthy ML results. You can choose from these partitioning methods:

1. **Cross-Validation (K-Fold)**
   * Divides data into *K* equal folds, using *K-1* for training and 1 for testing in each iteration
   * Ideal for: Small-to-medium datasets, maximizing data usage
   * Common practice: 5-fold or 10-fold configurations
   * Options to set:
     * **num\_folds**: number of folds to use (*K*).
2. **Random Sub-Sampling**
   * Splits data randomly into fixed train/test percentages (e.g., 80%/20%)
   * Ideal for: Large datasets, quick prototyping
   * Tip: Stratified sampling maintains class proportions
   * Options to set:
     * **test\_size**: Proportion of the data to allocate for the testing set (must be between 0 and 1).
     * **n\_iterations**: Number of repetitions, i.e. number of splits to create. Increasing the repetitions can reduce the uncertainty in the performance estimates.
3. **Bootstrapping**
   * Creates multiple samples with replacement, then aggregates results
   * Ideal for: Very small datasets, estimating model stability
   * Advantage: Simulates having more data than available
   * Options to set:
     * **bootstrap\_train\_sample\_size**: The proportion of the dataset to resample with replacement.
     * **n\_iterations**: Number of bootstraps/splits to create. A higher number of iterations can reduce the uncertainty in the performance estimates.
4. **User-Defined**
   * Manually specify training/validation indices or custom splitting logic
   * Ideal for: Time-series data, special evaluation schemes
   * Flexibility: Import predefined splits or implement unique rules
   * Options to set:
     * **train\_indices**: List of training indices for the training set
     * **test\_indices**: List of testing indices for the testing set

#### **General Settings for the Split Node**

Before running your experiment, configure these essential settings to control how your data is partitioned:

**1. Random State (`random_state`)**

* **Purpose:** Ensures reproducible splits by initializing the random number generator with a fixed seed.
* **Usage:**
  * Enter an integer value (e.g., `42`) to make split results consistent across runs.
  * Leave blank for truly random splits (not recommended for reproducible experiments).

**2. Stratify Columns (`stratify_columns`)**

* **Purpose:** Maintains the original distribution of key variables (e.g., class labels) in both training and test sets and prevents skewed splits that could bias model evaluation.
* **Requirements:**
  * At least one column must be selected.
  * Common choices: Target variables or demographic columns (e.g., age groups, gender).

**3. Use Tags for Stratification**

* **Purpose:** Leverages predefined [*Column Tags*](/medomics-docs/tutorials/design/input-module.md#feature-or-column-tagging-tools) or [*Row Tags*](/medomics-docs/tutorials/design/input-module.md#sample-or-row-grouping-tools-subset-creation-tool) as stratification groups.
* **How It Works:**
  1. Toggle this option to activate tag-based stratification.
  2. Available tags from your dataset will auto-populate in a dropdown menu.
  3. Select one or more tags to use as stratification criteria.
* **Key Notes:**
  * Compatible with `stratify_columns` (can be used simultaneously).
  * If no tags exist, the system will display a warning, but it will not impact execution.
  * Tags are especially useful for complex stratification schemes (e.g., multi-label scenarios).

{% hint style="warning" %}
Unlike other nodes, the Split node has no extra options
{% endhint %}

<figure><img src="/files/HuaC5IdR73F40Tm71x6m" alt=""><figcaption><p>Breakdown of the Split node</p></figcaption></figure>

## Model node: **Select and configure your machine learning algorithm**

This node enables you to select and customize your machine learning model. The available models and their parameters directly correspond to:

* The `estimator` parameter in [PyCaret's `create_model()` function](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.create_model)
* [Scikit-learn](https://scikit-learn.org/stable/)'s comprehensive model implementations

The Model node is used in the algorithm selection step, as shown below:

<figure><img src="/files/67vmjoS2ivcYm8GLyXDZ" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/1o3a61vfIRLXCLuYGLlR" alt=""><figcaption><p>Breakdown of the Model node</p></figcaption></figure>

**On the next page, you will learn about the Training Box, which will help you define the training process of your experiment.**&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://medomicslab.gitbook.io/medomics-docs/tutorials/development/learning-module/initialization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
