Learning Module

This is where the magic happens 🧙🪄

The Learning Module is constructed using the open-source Python library PyCaret. You can find valuable information about PyCaret at the following links:

The Learning Module's architecture

The Learning Module has been redesigned with an updated architecture that adheres to machine learning practices, introducing external training/testing data separation to support multi-iteration model training. This enhanced approach provides more reliable performance estimation while specifically overcoming PyCaret's limitations in external data splitting. The following diagram presents the new architecture:

The new Learning Module workflow

Enhanced external training and validation

The updated workflow introduces flexible dataset partitioning through multiple validation methods (cross-validation, bootstrapping, etc.), resolving PyCaret's limitation in handling external data splitting. For clarity:

  • External splits divide the learning set into training/testing data

  • Internal splits further partition training data for hyperparameter tuning

The figure below illustrates this enhanced validation framework.

External and Internal splits illustration

What relies on PyCaret, and what does not use it?

Refer to this section for more details on PyCaret's role in the Learning Module.

The Learning Module is built on PyCaret's open-source machine learning framework, enhanced with custom-coded components to enable new features, such as external data splitting —a capability not supported in PyCaret. The figure below highlights which elements leverage PyCaret's standard functionality versus our custom extensions, giving you a better understanding of the architecture.

Custom-coded and PyCaret based sections of the ML workflow

A new design is here

The learning Module features a new redesigned interface aligned with the new architecture, which offers users a streamlined way to create their scenes and experiments. The updated interface is organized into three boxes:

  • Initialization: Users begin by selecting their machine learning resources and configuring key experiment parameters, including dataset selection, data preprocessing steps, model choices, and more.

  • Training: This section enables users to define and manage the model training process, encompassing aspects such as hyperparameter tuning, optimization strategies, and additional features.

  • Analysis: In the final stage, users can visualize and interpret their model’s performance through a variety of result plots and metrics.

The Learning Module: before and after

A new color-code system

The intuitive box-based design simplifies pipeline creation by visually guiding users through each step. A color-coding system works alongside the boxes to prevent errors. For example, as illustrated below, if you accidentally attempt to drag a Train Model node into the Initialization box, both the node and box will turn red, immediately alerting you to the mismatch. Each box only accepts compatible node types, ensuring logical connections and reducing setup mistakes.

Misplaced Train Model node
Misplaced Clean Node
Error message upon running the experiment

In Results or Analysis modes, a different color-coding is used; read more about it here.

A new scene for experimenting

The learning module includes a new Experimental Scene, a minimalistic scene designed for testing machine learning configurations (models, parameters, etc.) before finalizing them in the main production scene.

As shown in the figure below, the Experimental Scene’s minimalistic design focuses attention on core machine learning elements, with all required node types available. The scene serves as a testing ground where users can refine their pipelines before switching to the main scene.

Example of an Experimental scene

A Redefined Pipeline Structure

In the previous design, a pipeline was defined as any sequence of connected nodes. The updated architecture now defines a pipeline as a complete sequence of nodes that starts from an initial node and terminates at the Analysis Box. This crucial change means that any disconnected node chain or incomplete workflow will not be recognized as a valid pipeline for execution or analysis. By enforcing this complete connection, the platform ensures that users adhere to machine learning best practices and that every pipeline will be analyzed. The figure below illustrates an example of valid and invalid pipelines under this new definition.

Example of valid and invalid pipelines

Overview Videos

Learning Module - Scene Creation
Learning Module - Understanding Pipelines and Results

How to create a scene

1

Click on the Learning Module Icon.

On the left sidebar, click on the following icon:

2

Click on the "Create scene" button

The button can be located on top of the side panel.

3

Enter a name for the new scene

Type your scene's name in the field "Enter Page Name":

4

Select whether it's an Experimental scene or not

If you would like to create an Experimental scene, ensure the following switch is on:

5

Click Create.

These steps are summarized in the figure below. Once your scene is created, a folder will be generated that includes the following:

  • Your scene (.medml file):

  • A folder for your scene models:

  • A folder for your scene notebooks:

How to create a new scene
Your scene's folder in the application's workspace

Module Overview

Double-click on the .medml file to open the scene.

The following sections provide a comprehensive overview of the scene and its fundamental components. Each numbered element in the main scene figure corresponds to a detailed explanation in the subsequent subsections.

Main scene

Empty Main Scene

1. Scene folder breakdown

Every scene folder is organized as follows:

breast_cancer_analysis             -> Scene folder
    ├───models                     -> Folder where models are saved
    ├───notebooks                  -> Folder where notebooks are saved
    └───breast_cancer_scene.medml  -> Scene file

2. Available Nodes

General structure of a node

The Add options button opens a panel where you can select additional options. These options are retrieved from the online ReadTheDocs documentation of PyCaret.

Available nodes summary table:

Node
Description
Designated box
Input
Output

This acts as the initial point for all experiments and determines the data your pipeline will use. The available options for this node correspond to the PyCaret setup() function options that are not directly related to data cleaning.

-

Dataset

This node enables you to clean and improve the quality of your dataset. The available options for this node correspond to the PyCaret setup() function options that are directly related to data cleaning.

Dataset

Dataset

This uniquely custom-coded node (distinct from PyCaret’s standard functions) gives you precise control over how your dataset is divided for training and evaluation. It serves as the foundation for reliable model validation by ensuring appropriate data separation.

Dataset

Dataset

This node allows you to select a machine learning algorithm from PyCaret's model library and set its associated parameters. It corresponds to the estimator parameter of the PyCaret create_model() function.

-

Model_config

This node allows you to train a model using the selected ML algorithm. The available options for this node correspond to the PyCaret create_model() function options (except the estimator parameter, which is defined through the Model node).

Model_config +

Dataset

Model

This node allows you to combine multiple models using different techniques. It is based on PyCaret's blend_models() and stack_mdoels() functions.

Model

Model

This node allows you to train and evaluate the performance of all estimators available in the PyCaret model library using cross-validation. The available options for this node correspond to the PyCaret compare_models() function options.

Dataset

Model(s)

This node allows you to load a model from a file. It takes as input a model from the ones you saved in your scene, displayed in a dropdown selector. The available options for this node are the ones available in the PyCaret load_model() function, except the model name, which is replaced by the selected file.

Dataset

Model

This box allows you to analyze a model. It gathers the analysis and model explainability functions of PyCaret. For now, only the plot_model() function is used in the Learning Module.

-

Model

-

3. Analysis Mode

The Analysis Mode button, called See Results in Experimental scenes, is used to view the results of the experiment. It is disabled until you run an experiment. After a successful run, a .medmlres file is created in your scene folder, containing the generated results from the experiment. If you quit the app, your generated results will still be available the next time you open the app.

Refer to the Analysis page for more details.

4. Utils Menu

This menu contains different functionalities that can be used to help you build your scene.

Element
Description

Machine Learning type dropdown

This dropdown allows you to select the type of machine learning you want for your experiment. When changing the type, all settings are reset.

Play

This button allows you to run the experiment. You can find additional information about running the experiment here.

Garbage bin

This button allows you to delete all nodes in the scene.

Save

This button allows you to save the scene.

Load

This button allows you to load a scene from a file.

5. Minimap

This minimap allows you to navigate the scene and visualize the nodes present in it.

6. Flow Utils

This menu contains various functionalities that interact with the flow section.

Element
Description

Plus Button

This button allows you to zoom in the flow section.

Minus Button

This button allows you to zoom out the flow section.

Square Button

This button allows you to fit the flow section in the view.

Lock Button

This button allows you to lock the flow section. When locked, you can't move the flow section.

Map Button

This button allows you to show/hide the minimap.

7. Scene Boxes

In the main scene, these boxes are part of the new design, helping guide the user through creating their scene and reducing errors when dragging and placing nodes. Read more about these boxes in the next sections of the documentation.


Understand PyCaret's role within the Learning Module

PyCaret primarily implements functions from the scikit-learn library.

1. Initialization

At the beginning of a Machine Learning pipeline, you initialize your data using PyCaret's setup function, corresponding to the Dataset and Clean nodes in our Learning Module. The setup function requires a dataset and the name of the target column. PyCaret then initializes elements for the pipeline.

1.1. Test Data

PyCaret divides your dataset into two parts: the training set and the test set (controlled by the test_data parameter in the PyCaret setup function). The training data is employed to train and optimize your machine learning model, while the test data is reserved for evaluating the created model. The split is conducted using the scikit-learn train_test_split function (useful explanations about this function can be found here).

*In the figure, "Full Dataset" refers to our Learning Dataset.

The random sampling step is executed with the aid of a random seed, and each split is linked to a specific seed. By default, PyCaret randomly assigns a seed at the start of each pipeline execution. To ensure the replication of the same experiment with a consistent split, you can set this parameter in PyCaret (using the session_id parameter in the Dataset node), as demonstrated in our experiments in the instructional video. This ensures that your test and train data will remain consistent across all executions.

Here, you also have the option to define the test data yourself and provide it to PyCaret. However, this capability is not currently available in our application when using the MEDomicsLab Standard format.

1.2. Folds

Then PyCaret will define folds on the train data to use for the Cross-Validation part (which will be executed using the Train or Compare Models box). The definition of the folds will also be done using a random seed, which you can define through the session_id parameter of PyCaret. By default, PyCaret uses the StratifiedKFold method from sickit-learn to define the folds. The stratified method ensures that each class from the target is represented equally across each fold.

2. Training

There are two functions related to training in PyCaret: compare_models (corresponding to our Compare Models box) and create_model (corresponding to our Train node).

2.1. Compare Models

The compare_models function is used to train all the available models from PyCaret on the initialized data from the setup function of PyCaret (our Dataset and Clean nodes). The resulting table displayed shows you the mean of the Cross-Validation results of all the folds for each model. For example, if we have five folds, for each model, we train the model five times, using a different fold as validation data at each iteration. Then, we apply the trained model to the validation fold and keep the resulting metrics to calculate the mean with the validation results of the four other iterations (purple data from the split in the image shown below).

The output of the compare_model function is the best model found using a specified metric (Accuracy by default, AUC as we specified in our instruction video). If we set the n_select parameter (as shown in our instruction video), we return the specified number of models from the top of the list.

2.2. Create Model

The create_model function takes initialized data as an entry and a model (that you can define through our Model node). It works exactly the same way as the compare_models function, except that we only test one model, and the results table shows the Cross-Validation results of each fold.

3. Analyzing

The analyses made using our Analyze node are showing the metrics resulting from our trained models on the test data defined at the initialization of the experiment.

4. Finalize

The finalize function in PyCaret (represented by the Finalize node in our app) trains the model one last time on the entire dataset, which includes both the training data and the test data, without changing its parameters.

Last updated