# Step 6: Create Model

<figure><img src="https://content.gitbook.com/content/7cVTUTkb3KodRR4EGOZH/blobs/wTqAcZn3UzBb8XtWJB9A/MicrosoftTeams-image%20(5).png" alt=""><figcaption><p>Step 6 - Create Model</p></figcaption></figure>

{% hint style="info" %}
If you completed [*Step 4 - Explore Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-4), you have data ready for *Step 6 - Create Model*.&#x20;

However, before proceeding to *Step 6 - Create Model,* we recommend that you replace your own output data from [*Step 4 - Explore Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-4) (the *MEDprofiles*/*timePoints* folder) with the data that we prepared for you (*MEDomicsLab\_TestingPhase\_Step6.zip*). This will ensure consistency of results across all participants of the Testing Phase.&#x20;

An invitation to access the *MEDomicsLab\_TestingPhase\_Step6.zip* data was sent by email.&#x20;
{% endhint %}

In this current *Step 6 - Create Model*, we will leverage the functionalities of the [*Learning Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module) to build machine learning models using the learning set obtained from [*Step 4 - Explore Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-4). In this step, we'll create two Learning scenes:

**Scene 1: Time-Dependent Model Comparison**&#x20;

We aim to assess the impact of patient timelines on model performance, hypothesizing that the performance will increase with time, particularly nearing the last hospital stay. We will compare the best models from the following datasets:

1. Dataset from the data obtained at the first time point (*T1\_learning\_modified.csv*).
2. Dataset combining data from the first and second time points (*T1\_learning\_modified.csv* and *T2\_learning\_modified.csv*).
3. Dataset combining data from the first, second, and third time points (*T1\_learning\_modified.csv*, *T2\_learning\_modified.csv*, and *T3\_learning\_modified.csv*).
4. Dataset combining data from all time points (*T1\_learning\_modified.csv*, *T2\_learning\_modified.csv*, *T3\_learning\_modified.csv*, and *T4\_learning\_modified.csv*).

**Scene 2: Variable-Dependent Model Comparison**&#x20;

This scene aims to assess the impact of considered variables on model performance. We will use data from the first two time points (T1\_learning\_modified.csv and T2\_learning\_modified.csv), assuming that models involving data from the last time points might make predictions too late in a patient's timeline. We'll compare the best models from the following datasets:

1. All demographic and time-series data (*tslab*, *tsprocedure*, and *tschart* classes) from *T1\_learning\_modified.csv* and *T2\_learning\_modified.csv*.
2. All demographic and notes data (*ndischarge* and *nradiology*) from *T1\_learning\_modified.csv* and *T2\_learning\_modified.csv*.
3. All demographic and image data from *T1\_learning\_modified.csv* and *T2\_learning\_modified.csv*.
4. Selected variables from various data types based on observations made using the first three pipelines, aiming to obtain the best possible model.

These scenes are designed to provide a comprehensive comparison of models under different temporal and variable considerations.

{% hint style="info" %}
You are welcome to use this step to conduct your own experiments and explore the functionalities of the  [*Learning Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module). However, please note that there are some missing options and tooltips that we haven't implemented yet, and we intend to address these before [*Step 8 - Challenge*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-8).:wink:
{% endhint %}

## Recommendations

Before proceeding with *Step 6 - Create Model* of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the [*Learning Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module).

{% content-ref url="../tutorials/development/learning-module" %}
[learning-module](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module)
{% endcontent-ref %}

{% hint style="info" %}
Please note that the [*Learning Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module) is a graphical implementation of the [*PyCaret* Python library](https://pycaret.gitbook.io/docs/). Additionally, if you are seeking information about elements in the Learning Module, you may find it in the [*PyCaret* documentation](https://pycaret.gitbook.io/docs/).

The [*PyCaret* documentation](https://pycaret.gitbook.io/docs/) often refers to other Python packages, as they built their functions around these packages. If you want to learn more about some options of certain functionalities, you may need to search in these other packages to find the information you are looking for.

For example, if you are looking for information on the `fold_strategy` parameter in the Dataset box:

1. Visit the [*PyCaret* documentation](https://pycaret.gitbook.io/docs/), specifically the [Data Preprocessing section](https://pycaret.gitbook.io/docs/get-started/preprocessing).
2. Look for the category related to the `fold_strategy` parameter, which is under Other Setup Parameters -> Model Selection.
3. The [Model Selection part](https://pycaret.gitbook.io/docs/get-started/preprocessing/other-setup-parameters#model-selection) contains explanations about related parameters, including the `fold_strategy` parameter. It specifies that this parameter takes, as input, predefined strings or a cross-validation object compatible with [*scikit-learn*](https://scikit-learn.org/stable/). If you want additional information about the possible parameters, you'll have to search for the information on your own in the [*scikit-learn* documentation](https://scikit-learn.org/stable/). For example, if you want to know more about the default value for `fold_strategy` (which is `stratifiedkfold`), you will have to search for 'stratifiedkfold' in the [*scikit-learn* documentation](https://scikit-learn.org/stable/). The page related to this information is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html).

Also, if you want to fully understand how [*PyCaret*](https://pycaret.gitbook.io/docs/) works in the background, this is an open-source library, and the code is available on [GitHub](https://github.com/pycaret/pycaret). (As we use the 3.1.0 version in our application, we recommend you to consult the [3.1.0 code](https://github.com/pycaret/pycaret/tree/3.1.0) if your research is related to our application).
{% endhint %}

{% hint style="info" %}
Please pay attention to our last sections in the [*Learning Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/development/learning-module):

* What PyCaret does?
* PyCaret ROC (Receiver Operating Characteristic)/AUC (Area Under the Curve) plots
  {% endhint %}

## Instructions for Step 6 - Create Model

{% embed url="<https://youtu.be/bQnE7KSCHOA?si=sNMjQspkgtfqVAsP>" %}

**Content**

Intro [0:00](https://www.youtube.com/watch?v=bQnE7KSCHOA\&t=0s)

First Pipeline [1:09](https://www.youtube.com/watch?v=bQnE7KSCHOA\&t=69s)

Explanations about PyCaret [5:37](https://www.youtube.com/watch?v=bQnE7KSCHOA\&t=337s)

Scene 1: Time-Dependent Model Comparison [7:35](https://www.youtube.com/watch?v=bQnE7KSCHOA\&t=455s)

Scene 2: Variable-Dependent Model Comparison [17:12](https://www.youtube.com/watch?v=bQnE7KSCHOA\&t=1032s)
