Data Overview and Segmentation

In this article

Data Source

A top-level view of the project is provided. This contains information on the volume and contents of the uploaded data, as well as any sampling and partitioning that have been performed.

The name of the uploaded dataset is shown, as well as the number of rows and columns. The " Download Uploaded Data" button allows the user to download the dataset.

Sampling is displayed if used. The " Configure Sampling" button allows the user to configure sampling. If the data has been sampled, a button to download the sampled data appears.

Data is automatically partitioned into three subsets for the purpose of model-building:

  • Training: Used to fit the parameters of the model.
  • Validation: this subset is used during the training process to prevent the model from overfitting the training data; it is, learning the training subset very well but being unable to generalise to future unseen instances. In order to do so, the performance of the model is measured against the validation set at every epoch during training, to ensure that the model does generalise to the data beyond the training set.
  • Testing: this subset is used as a final holdout sample, to assess the final model performance on unseen data. Hence, the performance of the model on this set provides the user with the best indication of how accurate the model is likely to be on unseen data. 

The " Configure Partitions" button allows the user to modify the partitioning settings. If models have been built, the partitioning is fixed and the button is greyed out.

Columns in the uploaded file are displayed on the right side of the page, along with their data type and number of unique values.

Sampling

If the dataset has more records than the sampling threshold then you will need to perform a sampling of the dataset before proceeding further. The sampling threshold is 500,000 rows, by default, and can be edited in the user profile settings.

In the first step, you must decide on a  Sample Size. This is allowed to be 100%, but it must be manually declared at that amount so that a significantly large dataset is not accidentally processed in the platform.

In the second step, you must pick the individual sample sizes of the target classes. This is defaulted to the split of the data but can be adjusted if you would like to under-sample the majority class. As long as the sum of the class proportions is 100%, then any such split is permitted, given that the total number of the minority class doesn't exceed the count in the raw file.

Partitioning

After the dataset has been processed, you can set the proportion of the records to be segmented into training, validation, and testing populations.

"Shuffling" will randomise the instance order of the segments, and "Balance" will ensure that the target classes are evenly distributed across the training, validation and test sets, weighted by the data-split. "Random" reshuffles the data every time a model is built.

If shuffling is No, Balance is automatically No.

You may " Use Testing data for validation" if you have an out-of-sample testing set external to the platform.  In cases where there is only a small number of records, you may want to use testing data for validation to ensure there are sufficient instances of the minority class, however, this is not advisable if it can be avoided.

Still need help? Contact Us Contact Us