Create a Dataset


This topic introduces how to create datasets.

Create Datasets from External Data Sources

Dataset types supported by each data source connection are as follows:

Data Source Connection

Supported Dataset Type

MySQL

Tabular Dataset

Hive

Tabular Dataset

Blob

Tabular Datasets (Delimited Text or ORC file format), File Dataset

S3

Tabular Datasets (Delimited Text or ORC file format), File Dataset

HDFS

Tabular Datasets (Delimited Text or ORC file format), File Dataset

Create Datasets from MySQL or Hive Data Sources

To create a dataset from MySQL or Hive data source connection (using MySQL data source as example):

  1. Log in to the EnOS Management Console and select Enterprise Analytics Platform > Machine Intelligence Studio > Dataset Management.

  2. Click New Dataset > Create from Data Sources and complete the following basic information of the dataset:

    • Dataset Name: Enter name of the dataset.

    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).

    • Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.

    • Dataset Type: Select tabular.

    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).

    • Description: Enter description of the dataset.

    ../_images/creating_dataset_1.png
  3. On the Data Source page, enter the SQL query statement and query timeout value (1 ~ 600 seconds). Note that when creating a Tabular dataset from data source connections, you can use a single SQL query statement only.

    ../_images/creating_dataset_2.png
  4. On the Data Preview page, click Preview Data to view the query results (displaying the first 50 data records of the query result only).

    ../_images/creating_dataset_3.png
  5. On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, click the Reset button to restore the default settings.

    ../_images/creating_dataset_4.png
  6. On the Confirmation page, check the completeness of the dataset configuration. Click Finish to create the dataset. The created dataset will be displayed in the dataset list.

    ../_images/creating_dataset_5.png

Creating Datasets from Blob, S3, or HDFS Data Sources

To create a dataset from Blob, S3, or HDFS data source connection (using HDFS data source to create File type dataset as example):

  1. Log in to the EnOS Management Console and select Enterprise Analytics Platform > Machine Intelligence Studio > Dataset Management.

  2. Click New Dataset > Create from Data Sources and complete the following basic information of the dataset:

    • Dataset Name: Enter name of the dataset.

    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).

    • Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.

    • Dataset Type: Select file.

    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).

    • Description: Enter description of the dataset.

    ../_images/creating_file_dataset_1.png
  3. On the Data Source page, enter path of the file to be used.

    ../_images/creating_file_dataset_2.png
  4. On the Confirmation page, check the completeness of the dataset configuration. Click Finish to create the dataset. The created dataset will be displayed in the dataset list.

    ../_images/creating_file_dataset_3.png

Create Datasets from Local Files

To create a dataset from a local file:

  1. Log in to the EnOS Management Console and select Enterprise Analytics Platform > Machine Intelligence Studio > Dataset Management.

  2. Click New Dataset > Create from Local Files and complete the following basic information of the dataset:

    • Dataset Name: Enter name of the dataset.

    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).

    • Dataset Type: Select tabular or file:

      • tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed

      • file type: No need to select the file type

    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).

    • Description: Enter description of the dataset.

  3. On the Upload File page, select to upload one or multiple files for creating the dataset. The selected files to be uploaded will be displayed in the file list, including the file name and file size.

  4. After uploading the needed files, complete the file configuration, data preview, and Schema settings, and confirmation based on the selected type of dataset to be created. For detailed steps, see Creating Dataset from External Data Sources.

Note

  • If you want to upload extra files after some files are uploaded successfully, you need to upload all the files again because uploading new files will overwrite the files that are already uploaded previously.

  • A single batch of uploaded files must not exceed 1GB, and the total uploaded file size must not exceed 10GB. For big files, consider uploading the files to HDFS, Blob, or S3 stores for creating datasets.

Create Datasets from Operator Output/Input

To create a dataset from operator output/input file:

  1. Log in to the EnOS Management Console and select Enterprise Analytics Platform > Machine Intelligence Studio > Dataset Management.

  2. Click New Dataset > Create from input/output files of operators and complete the following basic information of the dataset:

    • Dataset Name: Enter the dataset name

    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list)

    • Dataset Type: Select tabular or file:

      • tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed

      • file type: No need to select the file type

    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list)

    • Description: Enter description of the dataset

  3. On the File Selection page, enter the minio path of the file. See View the Basic Information and Details of Running Instances for information about where to get the minio path.

  4. On the File Configuration page, set column delimiter, character set, escape character, quote character, and so on.

  5. On the Data Preview page, click Preview Data to view the query results (displaying the first 50 data records of the query result only).

  6. On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, click the Reset button to restore the default settings.

  7. On the Confirmation page, check the completeness of the dataset configuration. Click Finish to create the dataset. The created dataset will be displayed in the dataset list.