githubEdit

2Input Module

This page documents the Input Module step of the demo, where we will perform two processing steps on our new "homr_any_visit_10pct.csv" file before model training in the Learning Module.

The Input Module provides multiple data processing key tools needed to fulfill various tasks within the MEDomics platform. In this proof of concept, we will use two tools from the Input Module : the Column Tagging Tools and the Holdout Set Creation Tools. We will also use the MEDomics editor to delete a column.

circle-check

Column Deletion

Before moving into the Input Module, we have to delete a column from our dataset file named "CSO". This column is not used in the original study and therefore we don't need to keep it.

Double click on the homr_any_visit_10pct.csv file in your workspace to open it in the MEDomics editor. Then, click on the bin above the "CSO" column to delete it.

Data in MEDomics editor

Column Tagging

This toolarrow-up-right is a core component of the MEDomics platform, as it enables the use of the MEDomics Standard data format.

Follow the steps on the figure below to access the Column Tagging tool in the Input Module.

Steps to the Column Tagging tools

In MEDomics, a tag represents a group of columns. Each tag corresponds to a coherent subset of features sharing a common meaning or role (e.g., administrative data, demographic variables, clinical diagnoses). Column selection for tags is defined by the user based on data understanding and domain knowledge.

The MEDomics Standard format is built on this tagging mechanism. Rather than relying on a fixed dataset schema, MEDomics allows users to define multiple semantic views over the same dataset through tags. This design provides flexibility while preserving consistency and traceability.

Dataset Structure

The predictors in our dataset include:

  • Demographics (age and sex at birth) – 2 variables

  • Admission characteristics – 10 variables

  • Comorbidity diagnoses – 85 binary variables

  • Admission diagnoses – 147 binary variables

This results in a total of 244 predictors.

Predictor Sets in the POYM Study

The POYM study defines two predictor sets for model training and evaluation:

  • AdmDemo β†’ Adm (Admission characteristics) + Demo (Demographics)

  • AdmDemoDx β†’ Adm (Admission characteristics) + Demo (Demographics) + Dx (Comorbidity diagnoses + Admission diagnoses)

For this proof of concept, we represent these predictor sets using three tags:

  • Adm β†’ Admission characteristics (10 variables)

  • Demo β†’ Demographics (2 variables: age_original, gender)

  • Dx β†’ Comorbidity diagnoses (85) + Admission diagnoses (147)

To assign tags to variables:

  1. Open the Input Module from the left navigation panel.

  2. Under Data Organization, select Structuring & Tagging.

  3. Click on Column Tagging Tools.

This tool allows you to assign the appropriate tag (adm, demo, or dx) to each variable according to the study definition.

Variable Mapping by Tag

Tag
Description
Number of Variables

Adm

Admission characteristics :

  • ed_visit_count

  • ho_ambulance_count

  • total_duration

  • flu_season

  • living_status

  • admission_group

  • is_ambulance

  • is_icu_start_ho

  • is_urg_readm

  • service_group

Simply copy paste the following code line into the tagging tool:

10

Demo

Demographics :

  • age_original

  • gender

Simply copy paste the following code line into the tagging tool:

2

Dx

Comorbidity diagnoses + Admission diagnoses (the rest of the columns) Simply copy paste the following code line into the tagging tool:

232

circle-exclamation

The figure below illustrates the process of assigning tags to dataset columns using the Column Tagging Tools.

  1. Select the dataset (homr_any_visit_10pct.csv).

  2. Create the three required tags: adm, demo, and dx by entering their names one after the other and hitting enter.

  3. Copy-paste the column names corresponding to each table from the table above.

  4. Choose the appropriate tag to apply.

  5. Click Apply tags to validate the configuration.

circle-info

The third step presents two alternative ways to assign columns to their corresponding tags.

This can be done either by:

  • Pasting the column names manually, or

  • Selecting the columns directly from the displayed dataset.

In this example, the variables age_original and gender are assigned to the demo tag.

Create the "adm", "demo" and "dx" tags

You can visualize the tags within the dataset in the MEDomics editor.

circle-exclamation

Holdout set creation

After creating our tags, the final step is to split our data into a learning set and a holdout set.

For this task, we will use the Holdout Set Creation Tools. To access this toolarrow-up-right, select Sampling under the Data Wrangling section in the Input Module.

Sampling in the Data Wrangling section

After selecting the dataset (homr_any_visit_10pct.csv):

  1. Enable Shuffle and Stratify.

  2. Select oym as the target column.

  3. Set the split percentage to 20%.

  4. Choose "drop" as the empty cells cleaning method.

  5. Activate the Keep tags toggle.

  6. Click the Save icon to create the Learning and Holdout sets.

These steps are illustrated in the figure below.

Create Learning and Holdout sets from our dataset
circle-check

With the creation of the Holdout and the Learning sets, we conclude our Input Module steps, and we can now start the machine learning phase.

This step ensures that the dataset is properly prepared for the demo and ready to be used in a complete end-to-end workflow within MEDomics, including the Learning, Evaluation and Application modules. In the next section, we will use homr_any_visit_10pct.csv dataset (with the applied tags preserved, of course!) to run machine learning experiments and replicate the POYM study.

This concludes our Input Module section. Now our data is ready for model training!

Last updated