Create New Project

In this article

Uploading

To start a new project, click the button on the top left of the landing page.

You then begin the four-step process of setting up a new project:

  1. Data Source
  2. Columns
  3. Intelligence Task
  4. Project Name

First, you must define the data source, which can be done by uploading a new file from your device or choosing an existing file.

You can upload delimited, excel spreadsheet and zipped data files up to a maximum size of 1GB.

After the file has been selected, it will undergo processing to extract information about the data for display, such as inter-feature correlations to audit for inclusions and exclusions.

Errors may arise for several reasons: 

  • Please make sure that there are not more data columns than column headers. A common example is an index column at the start (especially if you use Python to generate the file). 
  • Beware of stray delimiters within a cell that will shift columns into each other. 
  • If delimiters are contained within the data, please ensure that the cells are double-quoted.

Column selection

After uploading or importing the dataset, it will display a list of all the columns within it.

Here you can select the columns you would like included in your project.

If there are no valid columns for the relevant output type, the process will be halted. There must be at least one binary column for a binary prediction, or one continuous column for a regression task.

The columns can also be edited in a text box if you have a predefined list of columns to include. The Reset button will repopulate the text box so that all columns in the dataset are included.

Intelligence task

Choose the feature which is the target or the output for this particular intelligence task.

In a binary classification intelligence task, a drop-down list will appear of all the columns that have two unique values.

If the chosen problem type is regression, the drop-down list will only show the columns with continuous values.

If the chosen problem type is Multiclass, the drop-down list will show all the columns that have more than two unique values.

By default, the very last valid target column will be pre-selected.

After the project has been processed, information and statistics on the target can be viewed under "Intelligence Task" on the sidebar. 

When building a regression project there are no classes.

When building a multiclass project, the information and statistics of all the Target Classes can be viewed under "Intelligence Task" as shown below:

Project name 

Give your project a name. It cannot be the same as any other currently existing project.

The default project name suggested by the platform will be in the format "<Filename> - <Target Name>".

The categorical threshold is the maximum number of unique numerical values in a field that decides whether a field is categorical or not.

The split value threshold is the maximum percentage of occurrences of a given category for it to be treated as "Other" in a One-Hot encoding.

The correlation threshold is the minimum percentage at which features will be flagged as being too correlated with each other in the Audit stage after a project has been created.

If the dataset used is large, using a subset of the population to generate the statistics will significantly speed up the processing time before the project is available to build models on the platform. The subset size field is only visible if the checkbox is ticked.

Regression Sampling

The "regression sampling" algorithm aims to sample the original data in such a way that the sampled output distribution is almost uniform. Experimental testing performed by our team has proven that, in some cases, the resulting dataset (and the rules/dominances derived from it) can offer significant improvement when modelling regression problems using Fuzzy Logic models.
The algorithm requires two parameters:
- Number of bins to divide the output distribution (H).
- Number of desired samples per bin (N).

The algorithm is quite simple and work as follows:
- Divide the dataset in a histogram, containing as many equally-spaced bins as specified by H.
- From each bin, draw as many instances as specified in the parameter N.
    - If the bin contains more instances than N, then a simple sampling without replacement is performed.
    - If the bin contains less instances that N, then all of them are included at least once; the remaining samples are randomly drawn from the bin again with replacement.

The resulting dataset is then utilised as the main file in the project.


Select 'Use uniform sampling for this project' from the Project Name tab, enter the parameters "Number of bins for the sampling" (H) and "Number of instances per bin" (N). For the default parameters 20 and 500, 10,000 instances will be created in the new dataset.

After the project creation step begins the page will redirect to the landing page and display the new project.

Still need help? Contact Us Contact Us