# Step 4: Explore Data

<figure><img src="https://content.gitbook.com/content/7cVTUTkb3KodRR4EGOZH/blobs/OE2kWidggRpuORmTuQFO/MEDomicsLab-TestingPhase-12.png" alt=""><figcaption><p>Step 4 - Explore Data</p></figcaption></figure>

{% hint style="info" %}
If you completed [*Step 3 - Prepare ML tables*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-3), you have data ready for *Step 4 - Explore Data*.&#x20;

However, before proceeding to *Step 4 - Explore Data,* we recommend that you replace your own output data from [*Step 3 - Prepare ML tables*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-3) (the *MEDprofiles*/*timePoints* folder) with the data that we prepared for you (*MEDomicsLab\_TestingPhase\_Step4.zip*). This will ensure consistency of results across all participants of the Testing Phase.&#x20;

An invitation to access the *MEDomicsLab\_TestingPhase\_Step4.zip* data was sent by email.&#x20;
{% endhint %}

{% hint style="info" %}
The *MEDomicsLab\_TestingPhase\_Step4.zip* also includes a *new\_demographic\_embeddings* CSV file that you will need for this *Step 4 - Explore Data*. The rationale behind providing this new file is explained in the "Set the demographics in T1 only" section ([5:41](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=341s)) of the video.
{% endhint %}

The current *Step 4 - Explore Data* step is divided into seven parts, and involves exploring the learning set we obtained from [*Step 3 - Prepare ML tables*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-3) of the Testing Phase as follows:

1. **Analyze the learning set using&#x20;*****YData profiling*****:** Employ the YData profiling tool from the [*Exploratory Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module) to delve into your learning set. Record the percentages of missing values for each class across all time points.
2. **Set demographic embeddings in T1 only:** Based on the insights from part 1, eliminate demographic embeddings from all time point CSV files, including learning and holdout sets. Consolidate all demographic data into the T1 time point using the *new\_demographic\_embeddings* CSV file. Conduct this operation in the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module).
3. **Remove chart events from T1:** Referring to the analysis in part 1, eliminate chart events from the T1 datasets (both learning and holdout) using the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module).&#x20;
4. **Transform procedure events:** Building on the findings from part 1, transform the procedure events columns in all time point datasets (learning and holdout sets) using the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module).
5. **Analyze the learning set using &#x20;*****D-Tale*****:** Leverage the *D-Tale* tool from the [*Exploratory Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module) to scrutinize your learning set. Explore the inter-variable correlation matrices for each time point.
6. **Analyze the learning set using &#x20;*****SweetViz*****:** Utilize the *SweetViz* tool from the [*Exploratory Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module) to study your learning set. Identify sets of variables exhibiting a high correlation rate, considering the observations made with *D-Tale*.
7. **Remove high correlated columns from the time points datasets:** In the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module), eliminate the variables identified as having high correlation rates from the time point datasets (learning and holdout sets), aligning with the insights gained in part 6.

{% hint style="info" %}
Please note that the CSV files for the time points obtained from [*Step 3 - Prepare ML tables*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-3) are already tagged. This is done by the [*MEDprofiles Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module/medprofiles) when exporting the data as time points.
{% endhint %}

{% hint style="info" %}
We encourage you not only to follow the video but also to independently utilize the [*Exploratory Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module) for exploring the learning set. This self-directed analysis will prove valuable in *Step 8 - Challenge*. :wink:&#x20;
{% endhint %}

## Recommendations

Before proceeding with *Step 4 - Explore Data* of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module) *and* [*Exploratory Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module).

{% content-ref url="../tutorials/design/input-module" %}
[input-module](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module)
{% endcontent-ref %}

{% content-ref url="../tutorials/design/exploratory-module" %}
[exploratory-module](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/exploratory-module)
{% endcontent-ref %}

{% hint style="warning" %}
Please consider the warnings mentioned on the [*Input Module*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/tutorials/design/input-module) page: we are continuously working on enhancing the MEDomicsLab platform.
{% endhint %}

## Instructions for Step 4 - Explore Data

{% embed url="<https://youtu.be/EbZ3xhG16pg?si=7aJkXHytZ3RAf58K>" %}

**Content**

Intro [0:00](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=0s)

YData-profiling [0:54](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=54s)

Set the demographics in T1 only [5:41](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=341s)

Remove chart events from T1 [12:04](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=724s)

Transform procedure events [14:00](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=840s)

D-Tale [18:50](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=1130s)

SweetViz [23:24](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=1404s)

Remove Highly Correlated Columns [28:26](https://www.youtube.com/watch?v=EbZ3xhG16pg\&t=1706s)

{% hint style="danger" %}
Kindly be informed that the last part, 'Remove Highly Correlated Columns,' is **optional** as it can be time-consuming. We recognize that this process might be lengthy, and we are actively working to enhance the *Delete Columns* tool in the *Input Module* to expedite this procedure in the future. Rest assured, even if you choose not to perform this last part, we will provide you with our own output data from [*Step 4 - Explore Data*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-4) for use in [*Step 5 - Create Model*](https://medomicslab.gitbook.io/medomics-docs/medomicslab-docs-v0/test-with-mimic/step-5)*.*
{% endhint %}
