Create New Project
In this article
Uploading
To start a new project, click the button on the top left of the landing page.
You then begin the four-step process of setting up a new project:
- Data Source
- Columns
- Intelligence Task
- Project Name
First, you must define the data source, which can be done by uploading a new file from your device or choosing an existing file.
You can upload delimited, excel spreadsheet and zipped data files up to a maximum size of 1GB.
After the file has been selected, it will undergo processing to extract information about the data for display, such as inter-feature correlations to audit for inclusions and exclusions.
Errors may arise for several reasons:
- Please make sure that there are not more data columns than column headers. A common example is an index column at the start (especially if you use Python to generate the file).
- Beware of stray delimiters within a cell that will shift columns into each other.
- If delimiters are contained within the data, please ensure that the cells are double-quoted.
Column selection
After uploading or importing the dataset, it will display a list of all the columns within it.
Here you can select the columns you would like included in your project.
If there are no valid columns for the relevant output type, the process will be halted. There must be at least one binary column for a binary prediction, or one continuous column for a regression task.
The columns can also be edited in a text box if you have a predefined list of columns to include. The Reset button will repopulate the text box so that all columns in the dataset are included.
Intelligence task
Choose the feature which is the target or the output for this particular intelligence task.
In a binary classification intelligence task, a drop-down list will appear of all the columns that have two unique values.
If the chosen problem type is regression, the drop-down list will only show the columns with continuous values.
If the chosen problem type is Multiclass, the drop-down list will show all the columns that have more than two unique values.
By default, the very last valid target column will be pre-selected.
After the project has been processed, information and statistics on the target can be viewed under "Intelligence Task" on the sidebar.
When building a regression project there are no classes.
When building a multiclass project, the information and statistics of all the Target Classes can be viewed under "Intelligence Task" as shown below:
Project name
Give your project a name. It cannot be the same as any other currently existing project.
The default project name suggested by the platform will be in the format "<Filename> - <Target Name>".
The categorical threshold is the maximum number of unique numerical values in a field that decides whether a field is categorical or not.
The split value threshold is the maximum percentage of occurrences of a given category for it to be treated as "Other" in a One-Hot encoding.
The correlation threshold is the minimum percentage at which features will be flagged as being too correlated with each other in the Audit stage after a project has been created.
If the dataset used is large, using a subset of the population to generate the statistics will significantly speed up the processing time before the project is available to build models on the platform. The subset size field is only visible if the checkbox is ticked.
Regression Sampling
After the project creation step begins the page will redirect to the landing page and display the new project.