Analysis

This page explains the Analysis Box functionality, including the Analysis Mode, and how it enables comprehensive model performance evaluation through detailed metrics and diagnostic tools.

The Analysis Box

The Analysis Box (Figure below) is the final component in the learning module pipeline, positioned immediately after the Training Box. It serves as the dedicated component for model evaluation, accepting inputs from:

Individual trained models (via Train Model nodes)
Model ensembles (via Combine Models nodes)

Key Characteristics:

Node-Free Design: Unlike other boxes, this is a preconfigured analysis terminal that cannot contain additional nodes.
PyCaret Integration: Implements plot_model() with the following parameter controls:
- Plot Metric (plot parameter):
  - Sets the evaluation visualization type (default: 'auc')
  - Options include: confusion matrix, feature importance, ROC curve, etc.
- Scale (scale parameter):
  - Adjusts output figure resolution (range: 0-1)
  - Higher values increase image quality and file size

The Analysis Box represents the "Analysis" section of the machine learning workflow:

The Analysis Mode

If you prefer a quick summary, jump to the following section.

The Analysis Mode becomes available after a successful experiment execution. When activated, a results panel appears at the bottom of the interface, displaying results for all pipelines in the current scene. This mode provides a detailed breakdown of results organized by pipeline and node.

Pipeline Results Structure:

Each pipeline, identifiable by its customizable name, presents results through the following node-specific information:

Dataset Node: Displays the training data table and all parameters applied through PyCaret's setup function.

Clean Node: Shows the preprocessing parameters configured in PyCaret's setup.

Split Node: Presents detailed split statistics, including sample counts per fold/iteration and class distribution metrics.

Model Node: Contains the complete set of performance metrics for the model.

Combine Models Node: Provides evaluation metrics for the combined model output.

Analysis Node: Displays the plot selected in the Analysis Box.

PyCaret ROC (Receiver Operating Characteristic)/AUC (Area Under the Curve) plots

The AUC plots generated by the PyCaret library are derived from the YellowBrick Python package, which extends the scikit-learn API. By default, the plot displays multiple curves:

The ROC curve per class for each class was computed using the one-vs-rest method (meaning that the considered class is treated as the positive class and all other classes as the negative class).
The micro-average curve is calculated by summing up all true positives and false positives across all classes.
The macro-average curve is the average of curves across all classes.

We acknowledge that these curves can be a bit confusing, especially with binary classification.

While using the YellowBrick package directly, we can set parameters to display only the classic ROC curve. However, we haven't found a way to directly set these parameters through our application with PyCaret yet. We are currently working on fixing this issue.

Finalize & Save Model

This feature, used through the button 'Finalize & Save Model' for a selected pipeline, performs two critical functions through PyCaret integration:

Model Finalization: Retrains the selected model on the complete dataset using PyCaret's finalize_model() function.
Model Saving: Saves the finalized model as a pickle file via PyCaret's save_model() function. The saved model appears in the experiment's models subfolder using the model's classname or the Model node's ID if it has been changed from the default one ('Model').

The process requires no parameter configuration, automatically preserving all training parameters from the original experiment.

Additionally, this button represents the "Final Model" section of the machine learning workflow, as shown in the following figure:

The Generate Feature

The Generate functionality exports the complete pipeline configuration as executable Python code in Jupyter Notebook format. You can generate a Jupyter Notebook using the Generate button next to a selected pipeline. The generated notebook, which mirrors the pipeline's structure and parameters, appears in the experiment's notebooks subfolder using the pipeline's current name as its file identifier.

This feature enables:

Deeper investigation of the training process
Custom code modifications for performance optimization
Enhanced reproducibility

Additionally, you can also launch any generated notebook directly from the application by simply double-clicking the file. Conversely, you can right-click and select the "Open in..." option to open your notebook in VSCode.

An example of a generated notebook, opened in VS Code, is shown below.

Pipeline naming conventions directly affect this feature. Check out the next section for more details.

Manage Pipelines

The Manage Pipelines interface serves two primary purposes:

Pipeline Overview: Displays a structured summary of all nodes comprising each pipeline and their connections.
Naming Control: Allows pipeline renaming, which simultaneously updates:
- The notebook filename in the Generate feature
- All experiment tracking references

The node's selection box

In both Analysis and Results modes, a checkbox is available at the top of each runnable node. Use this control to selectively display results for specific nodes, hiding the output of others. A green checkbox indicates that the node is a mandatory component of all pipelines; consequently, its results will always be displayed.

In the following example, only the results of the checked node Clean2 are displayed, while the other pipelines are hidden.

The highlighting feature

This feature enhances navigation in both Analysis and Results modes by dynamically applying distinct color codes to selected nodes and pipelines. It highlights the entire execution path of a chosen pipeline, making it easy to distinguish from others. The system uses the following color scheme to indicate status:

Orange: Used for non-executed nodes and the connecting edges of a non-executed pipeline.
Green: Indicates a selected and successfully executed node.
Blue: Highlights all elements (nodes and edges) of the currently selected pipeline.

This functionality is particularly valuable in complex scenes with multiple pipelines, as it simplifies the process of tracking and comparing results. The following figure illustrates these color codes in the context of different user interactions.

Summary of the Analysis Mode

A full breakdown of the Analysis Mode is presented in the following figure:

On the next page, you will learn more about the new scene type 'Experimental' and how you can use it as a testing environment for your machine learning experiments.

PreviousTraining NextExperimental Scene

Last updated 3 months ago