Features

In this article

Features

A feature set is a collection of features that can be used to build models or perform other Intelligence Tasks. By default, "All features" and "Base Features" are created: 

  • "All features" includes all uploaded features, including the target, as well as the derived ones (binned continuous and/or merged categoricals); all derived features have to be created from this feature set. 
  • "Base features" is the default feature set to build models, which includes all uploaded features except the target, and any other feature removed in the Audit.

Clicking into any feature set shows a page outlining:

  • The name of each feature: either the feature name or the alias if a data dictionary has been used.
  • Type: Features can be one of three types:
    • Continuous: the feature is entirely numerical, with more distinct values than the "Categorical Threshold" set in the "Encoding Options".
    • Categorical: the feature only contains different categories/classes.
    • Mixed: a continuous feature but with one or more categories, for example an "Unknown", "NaN" or "blank" category, representing missing values. Other categories representing non-blank values are allowed.
  • Number of unique values in the field
  • PSI (Population Stability Index): a comparison of the distribution between training and testing sets.
  • IV (Information Value): a measure of the relationship between the feature and the score. $$IV = \sum{(Good\% - Bad\%)\times ln(Good\% / Bad\%)}$$
  • Gini (Gini coefficient): a measure of how well the feature can predict the outcome compared to that achieved by chance. A value of 1 indicates perfect predictions, whereas 0 indicates the relationship is random.

Clicking on a feature name itself will take you to a detailed screen which shows the data contained within the column. This includes:

  • The current feature being inspected.
  • Feature type (continuous, categorical or mixed).
  • The number of unique values in the feature.
  • If the feature is mixed or continuous, then the 5 smallest and largest values along with the counts.
  • The 5 most common values, and the 5 least common values. If there are equal numbers of least common values, 5 are selected at random.
You can also create a binned feature from a continuous or mixed feature, by setting bucket boundaries for the values to go into.

If the feature is mixed or continuous, the fuzzy set boundaries for the membership functions are outlined. This displays values for the upper and lower membership functions, which indicate the overlapping regions (where adjacent sets do overlap) and the core regions (when they do not).

Finally, the top 5 most strongly correlated features with the current feature are displayed, along with their coefficient and whether the feature in question is positively or negatively correlated. Click here for more information on the Pearson correlation.

Clicking on a feature's data tab will take you to a detailed screen which shows the data in either a bar chart or a histogram, depending on whether the feature is continuous or categorical. If the feature is mixed, there will be bars for both numerical and discrete values.

You can toggle between  grid view and chart view in the top left corner, to switch to the chart view. In the chart view, you can hover over a bin or categorical feature value to see its basic information.

The bars are stacked by quantity and outcome, the green line over it displays the outcome ratio. If the data is continuous or mixed, you can click on the fuzzy set to see the boundaries. If you hover over a set, it will display the numbers associated with the set boundaries.

In regression projects, there is no stacking of charts, and the green line shows the average value of the instances in a bin.

In multiclass project, there are bars stacked by quantity and outcome, the green line over it displays the outcome ratio. If the data is continuous or mixed, you can click on the fuzzy set to see the boundaries. If you hover over a set, it will display the numbers associated with the set boundaries.

There are three views to toggle between,  stacked (default), scaled, and grouped:

You can visit the  Encoding tab to alter the encoding type of the column. The current encoding type is displayed by default.

Encoding options for a continuous column.

Encoding options for a categorical column.

Encoding options for a mixed column.

Clicking on the column itself will let you order the features in ascending, descending or original form.

Hovering over a column name will also show a burger drop-down icon, clicking on it opens up options to filter and sort the columns.

The 2nd tab is used to filter the values of a specific column. If the column displays numerical values then this tab can be used to filter the data based on some logical condition as shown below:

The 3rd tab from the options can be used to select the metrics to be displayed.

Creating a new feature set

Click on the New Feature Set button on the top left to create a new feature set. It will also display relevant metrics to assist with your choices. Once you have selected the features you would like in the set, save it using the Save Features button.

As with project names, please give it a unique name.

Once the feature set has been created, you can Rename, Duplicate, or Delete it. The features within it can be altered after creation as well using the Edit Features button. New models can be built using this feature set using the New Model button. You can also set it as the default feature set using the Set as Default button with which to build models. This will ensure that any new models built are using this group of features, unless the default feature set is changed, or unless at the modelling stage a different feature set is manually chosen for that bunch of models.

Any stage of a feature selection process may also be used as a feature set to build models upon, or duplicate and modify as above. Feature selection steps cannot be modified or removed.

As of Version 2022.10, it's now possible to sort features by their inclusion status by clicking "Included".

You can also edit the feature set as text using the 'Edit as text' button.

Editing as text is helpful in copy-pasting the list of features from an external source (i.e. an Excel file or a Python list), specially if the number of features in your dataset is large.

Still need help? Contact Us Contact Us