# Step 3: Prepare ML tables

<figure><img src="https://content.gitbook.com/content/7cVTUTkb3KodRR4EGOZH/blobs/WxoTYcrhU8IYJAiuvADH/MEDomicsLab-TestingPhase-11.png" alt=""><figcaption><p>Step 3 - Prepare ML tables</p></figcaption></figure>

{% hint style="info" %}
If you completed [*Step 2 - Extract Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-2), you have data ready for *Step 3 - Prepare ML tables*.&#x20;

However, before proceeding to *Step 3 - Prepare ML tables,* we recommend that you replace your own output data from [*Step 2 - Extract Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-2) (the *extracted\_features* folder) with the data that we prepared for you (*MEDomicsLab\_TestingPhase\_Step3.zip*). This will ensure consistency of results across all participants of the Testing Phase.&#x20;

An invitation to access the *MEDomicsLab\_TestingPhase\_Step3.zip* data was sent by email.&#x20;
{% endhint %}

The current *Step 3 - Prepare ML tables* step is divided into five parts, and involves preparing Machine Learning tables using the extracted features from [*Step 2 - Extract Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-2) of the Testing Phase as follows:

1. **Reduce Extracted Features:** Use the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module) to reduce the large CSV files obtained from the previous step via Principal Component Analysis (PCA) and Spearman correlation.
2. **Merge All Data:** Combine the reduced extracted features with demographic embeddings into a master CSV table using the [*MEDprofiles package*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module/medprofiles). Additionally, create *MEDprofiles* with the master table.
3. **Visualize Data:** Use the *MEDprofiles* figure to visualize the data.
4. **Define Static Time Points:** Use the *MEDprofiles* figure to set static time points and export the data as static CSV tables.&#x20;
5. **Create&#x20;*****Learning*****&#x20;and&#x20;*****Holdout*****&#x20;Sets:** Use the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module) to generate Learning and Holdout sets.

{% hint style="info" %}
The goal of defining static time points is to simulate a longitudinal CDSS (Clinical Decision Support System) scenario using data aggregated over time. In [*Step 5 - Create Model*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-5) of the Testing Phase, we will attempt to identify the point in time where we reach sufficient predictive power (the point in time when, in real-life, we could potentially intervene).
{% endhint %}

## Recommendations

Before proceeding with *Step 3 - Prepare ML tables* of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module).

{% content-ref url="../tutorials/design/input-module" %}
[input-module](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module)
{% endcontent-ref %}

## Instructions for Step 3 - Prepare ML tables

{% embed url="<https://youtu.be/lVvszz01cDk?si=c2BqFgJ4Kw03RSIl>" %}

{% hint style="info" %}
**Reminder**: Make sure to save your datasets when updating column names by pressing the 'Save' button icon (an example is shown at 16:08 in the video above).&#x20;

If you do not press the 'Save' button icon after modifying a CSV file in the app, the changes will not be applied in your workspace.
{% endhint %}

**Content**

Intro [0:00](https://www.youtube.com/watch?v=lVvszz01cDk\&t=0s)

Reduce extracted features [0:50](https://www.youtube.com/watch?v=lVvszz01cDk\&t=50s)

Merge all our data [8:24](https://www.youtube.com/watch?v=lVvszz01cDk\&t=504s)

Visualize *MEDprofiles* [10:52](https://www.youtube.com/watch?v=lVvszz01cDk\&t=652s)&#x20;

Define static time points [12:01](https://www.youtube.com/watch?v=lVvszz01cDk\&t=721s)&#x20;

Create learning and holdout sets [14:13](https://www.youtube.com/watch?v=lVvszz01cDk\&t=853s)

***

{% hint style="info" %}
We acknowledge that using Spearman correlation with the target variable to massively reduce the feature set dimension **on the whole dataset** is not part of best practices in machine learning.&#x20;

This Spearman correlation process, if needed as a feature set reduction method, should normally be performed "on-the-fly" on the training sets of the *Learning set* (and ideally, the PCA process too).&#x20;

Here, we decided to use Spearman correlation on the whole dataset during the *Reduce extracted features* process to get around some difficulties we have in handling large feature sets in downstream processes.&#x20;

However, please note that we are actively working on enhancing the scalability of our application to eliminate the need of applying Spearman correlation on the whole dataset in the future.&#x20;
{% endhint %}
