Learning Module

This page documents the whole machine learning process of our demo, where we will use our dataset and specifically our tags to predict one-year mortality.

The Machine Learning section of this proof of concept is structured into two execution options.

Option 1 : Platform-Integrated Pipeline (Recommended for Continuity)

This option applies the predictive pipeline to the reduced dataset Learning_homr_any_visit_10pct.csv.

It ensures the continuity of the proof of concept by allowing a lighter, fully integrated end-to-end execution within MEDomics. The trained model is then saved and reused in the Evaluation and Application modules.

This option is designed to:

Maintain workflow continuity within the platform
Reduce computational load
Enable smooth transition to evaluation and deployment

Option 2 : Full Dataset Reproduction (Extended Configuration)

This optional configuration runs the same pipeline on the full dataset (homr_any_visit.csv), while adapting specific settings to accommodate memory and scalability constraints within MEDomics.

Since the complete pipeline has already been detailed in Option 1, no structural modifications are required. Instead, this option involves:

Updating the Dataset nodes to use the full dataset rather than the reduced version.
Adjusting the Train Model configuration to better manage scalability constraints within the platform.
Generating code notebooks to allow the model to be trained outside of MEDomics using parameters that more closely match the original study configuration.

This option enables deeper methodological alignment with the original study while ensuring reproducibility beyond the MEDomics environment.

Option 1 : Platform-Integrated Pipeline

This subsection provides a hands-on tutorial to create the machine learning scene enabling us to train a Random Forest model using the Learning_homr_any_visit_10pct.csv dataset.

If this is your first time working with the Learning Module, we recommend reviewing the dedicated documentation, which provides detailed explanations of scene creation and the module’s architecture.

We begin by creating a new scene.

Click on the Learning Module icon. The scene creation interface will appear, where you can name the scene homr_scene.

Do not select the Experimental Scene setup. Since we already have a predefined modeling strategy (Random Forest), there is no need to use the Experimental Scene, which is designed to automatically explore and compare multiple models.

You can learn more about the Experimental Scene configuration here.

Next step is accessing the new scene that you created. In your workspace, you will find a folder named homr_scene. Click on it and access the homr_scene.medml file.

This is an overview of the pipeline that we will have by the end of this section. In the following sections, each node will be described individually.

Follow the steps illustrated in the figure to create a scene :

Double-click on the Learning Module icon.
Click on Create scene.
Enter the page name: homr_scene.
Make sure the Experimental Scene toggle is disabled.
Click Create to generate the scene.

After creating the scene, it appears inside the EXPERIMENTS folder under the name homr_scene.

Inside this scene folder, there are 3 items :

models/ : This folder contains all trained models generated during the experiment. Each time a model is trained and saved, it will be stored here.
notebooks/ : This folder contains the generated notebooks associated with the scene. These notebooks allow you to reproduce or extend the experiment outside of MEDomics.
homr_scene.medml : This is the main scene file where the pipeline is built and configured. All nodes, connections, and experiment settings are defined inside this file.

Click on the homr_scene.medml file. Now, we can start the nodes configuration. The nodes are available in the left part of the screen, under 3 sections : Initialization, Training and Analysis.

If you do not see the list of available nodes, click on the blue menu button located in the top-left corner of the scene (the icon with three horizontal lines).

This button toggles the node panel and allows you to display or hide the list of nodes.

Nodes Configuration

We will present nodes by section (Initialization, Training and Analysis).

Initialization Nodes

You can learn more about Initialization Nodes here.

Dataset Node: This node is used twice in this experiment to represent the two predictor sets defined in the POYM study: AdmDemo and AdmDemoDx. Create two Dataset nodes, set both to the MEDomics Standard format, and name them accordingly, as shown in the figure below. Each node corresponds to a distinct group of predictors and relies on the previously created column tags.

Going one step further, select the homr_any_visit.csv file for each Dataset node, and for every ID (seen above), apply the corresponding tags, and define the target variable as "oym". This configuration step is illustrated in the second below.

Split Node: Configure the Outer Split to use cross-validation as the splitting method with 5 folds. This outer split defines the external loop of a 5-fold nested cross-validation setup; the inner splits will be specified later in the Train Model node. Under General Parameters, set the random_state to 101, as used in the original POYM study, to ensure reproducibility.

If you're unfamiliar with the nested cross-validation method in machine learning, you can check this link for more information.

Model Node: Select Random Forest as the machine learning algorithm.

The original study relies on SKRanger, a C++ implementation of the Random Forest algorithm, whereas our Learning Module is based on PyCaret, which builds on scikit-learn. To ensure methodological consistency, we therefore use the closest equivalent hyperparameters available in PyCaret to mirror those used in SKRanger. While minor implementation differences remain, this approach allows us to stay as close as possible to the original experimental setup.

Initial Model Configuration

Specifying initial values for the hyperparameters in the Model node is not mandatory.

Regardless of the initial values set (including default values), the model will ultimately be trained and optimized using the custom hyperparameter grid defined in the Train Model node.

Therefore, you may skip detailed initialization if desired. However, it is essential to ensure that the following hyperparameters are selected in the Model node, as only selected hyperparameters will be available for optimization in the Train Model node.

Hyperparameters to Select

The following hyperparameters must be selected to ensure they are optimized:

n_estimators — Number of decision trees in the Random Forest.
min_samples_leaf — Minimum number of training samples required in each terminal node (equivalent to min_node_size in SKRanger).
max_features — Number of features randomly selected at each split (equivalent to MTRY in SKRanger).
class_weight — Class imbalance handling strategy (equivalent to weight in SKRanger).
random_state — Reproducibility seed (equivalent to seed in SKRanger).

The specific values set at this stage do not impact the final optimized model, as the training process will rely on the custom hyperparameter grid defined in the next section.

However, you may initialize them using the values shown in the figure below for consistency.

Training Nodes

Train Model : In this section, we only use the Train Model node. This is the most configuration-heavy part of the tutorial, so make sure to follow each step carefully.

Make sure to activate the Tune Model toggle button as shown in the figure below.

Hyperparameter Tuning Configuration

When activating the Tune Model toggle in the Train Model node, you will notice that the option "Use PyCaret's default hyperparameter search space" is automatically enabled.

Since we are defining a custom tuning grid, the default PyCaret search space is not required. Make sure to deactivate this toggle.

Once disabled, the Custom Tuning Grid section for the Random Forest model becomes available, allowing you to configure the hyperparameters manually.

The figure below illustrates these steps.

Click on the plus button next to Random Forest. Each hyperparameter selected in the Model node will appear in the grid, where you can specify either:

A Range (start, end, step), or
Discrete values.

Custom Grid Configuration

Set the hyperparameters as follows:

1️. n_estimators

Number of trees in the forest.

Range values: {128, 256, 384, 512, 640, 768, 896, 1024}
Start: 128
End: 1024
Step: 128

2️. min_samples_leaf

Minimum number of samples required at a leaf node.

Range values: {10, 20, 30, 40, 50, 60, 70, 80}
Start: 10
End: 80
Step: 10

3️. max_features

Number of features considered at each split.

Discrete values: 10, 15, 20

4️. class_weight

Class imbalance handling strategy.

Discrete values: None, balanced, balanced_subsample

The optimization of class_weight differs from the original study. For this proof of concept, we adopt a simplified configuration to ensure stability and clarity within the MEDomics environment.

Tune Model Options

After defining the custom hyperparameter grid, configure the tuning options as follows:

1. fold

Set the fold parameter to 5.

This corresponds to the number of internal folds used in our 5-fold nested cross-validation setup.

2. search_library

Set the search_library parameter to "scikit-learn".

In the original study, hyperparameter optimization was performed using Optuna with 100 trials. However, within MEDomics, Optuna is currently supported only through a random search strategy.

For this proof of concept, we instead use Scikit-Learn, as it enables a structured and controlled grid search over predefined hyperparameter ranges.

3. search_algorithm

Set the search_algorithm parameter to "grid".

This ensures that all combinations within the defined hyperparameter grid are systematically evaluated.

Analysis Nodes

No changes required for the Analyze Node.

Pipeline Creation

If you are unfamiliar with the input/output (I/O) ports of each node, please refer to the documentation for more information.

The final step before running the scene is to connect the nodes to form the pipeline.

Connect the Dataset nodes to the Split node.
Connect the Split node to the first input of the Train Model node.
Connect the Model node to the second input of the Train Model node.
Finally, connect the Train Model node to the Analyze node to display the results.

Make sure all connections are correctly established before launching the scene.

Run the scene and analyze the results

An overview of each button in the Learning Module, along with its corresponding functionality, is available in the documentation. You can access it here.

Once the scene is fully configured, Click the Run button located in the top-right corner of the interface, as highlighted in the figure below.

You can monitor the progress using the progress bar displayed at the bottom of the interface.

When the execution is complete, the Analysis Mode button will become active. Click on it to open the analysis panel at the bottom of the screen, where the results for Pipeline 1 and Pipeline 2 will be displayed.

As shown in the figure below:

Pipeline 1 corresponds to the AdmDemo dataset node.
Pipeline 2 corresponds to the AdmDemoDx dataset node.

Select the models’ performance results to review and compare their evaluation metrics.

These are some of the results obtained from the AdmDemo pipeline :

And, the results from the AdmDemoDx pipeline :

Results

The reproduced models show a slight decrease in performance (in MEDomics) compared to the original POYM study.

For the AdmDemo model, the AUC decreased from 0.876 in the original study to 0.8565 when using the full dataset with limited hyperparameter tuning. When applying the full tuning strategy on the learning set of the reduced 10% dataset, the AUC further decreased to 0.8489.

Several factors explain these differences:

Dataset reduction: Training on the learning set of 10% of the data reduces the model’s ability to generalize, particularly for complex feature interactions.
Class weight configuration: Differences in class weighting strategies influence the decision boundaries and the sensitivity-specificity trade-off.
Scalability constraints in MEDomics: Platform memory and computational limitations required methodological adaptations that may slightly impact performance.

Overall, despite these constraints, the reproduced models achieve performance levels close to the original study, confirming the validity of the pipeline implementation.

Finalize and Save the model

In order to evaluate and deploy our model, click on the Finalize & Save Model button, seen on the pipeline presentation. Make sure to save the AdmDemo model (Pipeline 1). You can read this documentation for more info on how to proceed.

Save the scene and results using the Save button.

Option 2 : Full Dataset Reproduction (Extended Configuration)

This optional configuration runs the same pipeline described in Option 1, but on the full dataset (homr_any_visit.csv). The objective is to move closer to the original POYM study results while accounting for memory and scalability constraints within MEDomics.

It is strongly recommended to create a separate scene for this configuration to avoid overwriting or modifying the setup used in Option 1.

You may name this new scene: exp-with-all-data.medml.

Make sure to save both scenes (the reduced dataset scene and the full dataset scene) to preserve reproducibility and allow future comparisons.

Workflow Overview

For this option, you will have to :

Reuse the same pipeline architecture as in Option 1.
Replace the reduced dataset (Learning_homr_any_visit_10pct.csv) with the full dataset (homr_any_visit.csv) in the Dataset nodes.
Adjust the Train Model configuration to better manage scalability constraints within the platform.

No structural modifications to the pipeline are required.

Dataset Node Configuration

The Dataset node is used twice to represent the two predictor sets defined in the POYM study:

AdmDemo
AdmDemoDx

For both Dataset nodes:

Select homr_any_visit.csv.
Keep the format as MEDomics Standard.
Apply the appropriate column tags (adm, demo, dx) as previously defined.
Set the target variable to oym.

For this configuration to work, you have to reapply the tags to the homr_any_visit.csv dataset in the Input Module.

Train Model Adaptation

Due to scalability constraints within MEDomics, full hyperparameter tuning may not be feasible when using the complete dataset.

In this configuration:

The tuning strategy may be simplified (e.g., tuning a reduced subset of hyperparameters such as max_features only). To achieve this, select only one hyperparameter instead of three in the Model node configuration.
The rest of the pipeline remains unchanged.

These adjustments allow the experiment to run within the platform’s computational limits while maintaining methodological consistency.

As a result, the results presented below were obtained by tuning only a single hyperparameter: max_features using values from the grid above.

Run the scene and analyze the results

Once the scene is set up, hit the Run button, and track the execution process through the bottom progress bar. Select the models' performance results:

The AdmDemo model achieved an AUC of 0.8565, compared to 0.876 reported in the original study. Similarly, the AdmDemoDx model reached an AUC of 0.8908, compared to 0.905 in the reference study.

This performance gap can be primarily explained by two factors:

Differences in the class_weight configuration
Tuning only one hyperparameter (max_features) instead of the four hyperparameters optimized in the original study

Since this option uses the full dataset, results are higher than those obtained in Option 1.

The comparative table below presents a comparison of AUCs from the different configurations evaluated in the Learning Module section.

Model

Original Study

Full Dataset – 1 HP Tuning

10% Dataset – Full Tuning

AdmDemo

0.876

0.8565

0.8489

AdmDemoDx

0.905

0.8908

0.875

Notebook Generation

To go beyond MEDomics scalability limitations, the final step of Option 2 is to generate the notebook associated with the trained pipeline. This allows the same experiment to be executed externally with a configuration that more closely matches the original study.

To generate the notebook, simply click the Generate button.

If you do not execute the scene with the full dataset inside MEDomics, we provide the generated notebook directly in this section. All required steps and guidelines are included within the notebook itself.

In the notebook, we first reproduce the MEDomics configuration from the previous step, then progressively incorporate the missing elements from the original study setup:

Grid Search (scikit-learn) : Tune all study-defined hyperparameters using Grid Search, which exhaustively evaluates all predefined combinations.
Optuna (100 trials) : Perform hyperparameter tuning using Optuna with 100 trials. Optuna is an adaptive optimization framework (e.g., TPE) that explores the hyperparameter space more efficiently than Grid Search by prioritizing promising configurations based on previous trials.

The notebook is available at this OneDrive link. You can download it here.

This concludes our Learning Module section. Let's move to the Evaluation Module to test our saved model!

PreviousInput Module NextEvaluation Module

Last updated 1 month ago

hashtagOption 1 : Platform-Integrated Pipeline (Recommended for Continuity)

hashtagOption 2 : Full Dataset Reproduction (Extended Configuration)

hashtagOption 1 : Platform-Integrated Pipeline

hashtagNodes Configuration

hashtagInitialization Nodes

hashtagTraining Nodes

hashtagAnalysis Nodes

hashtagPipeline Creation

hashtagRun the scene and analyze the results

hashtagResults

hashtagOption 2 : Full Dataset Reproduction (Extended Configuration)

hashtagWorkflow Overview

hashtagDataset Node Configuration

hashtagTrain Model Adaptation

hashtagRun the scene and analyze the results

hashtagNotebook Generation

Option 1 : Platform-Integrated Pipeline (Recommended for Continuity)

Option 2 : Full Dataset Reproduction (Extended Configuration)

Option 1 : Platform-Integrated Pipeline

Nodes Configuration

Initialization Nodes

Training Nodes

Analysis Nodes

Pipeline Creation

Run the scene and analyze the results

Results

Option 2 : Full Dataset Reproduction (Extended Configuration)

Workflow Overview

Dataset Node Configuration

Train Model Adaptation

Run the scene and analyze the results

Notebook Generation