Unit 1. Preparing Data

Before training a machine learning model, you need to prepare sample data for model training.

Prerequisites

You should upload the wind power algorithm files and weather data files accumulated in the wind field to EnOS HDFS in advance to create the data tables required for model training. In this tutorial, the HDFS path for storing files is hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/.

Prepare Wind Power Algorithm Data

You can use map the files uploaded to EnOS HDFS to the Hive partition table, and use the wind farm ID (masterid) as the partition key to simulate the actual scenarios by following these steps.

  1. Use powers and speeds files to create a Hive partition table (enos_power_speeds) for partition by site (masterid) and type. The value corresponding to the powers type is power (power), while the value corresponding to the speeds type is speed (wind speed).

    create external table enos_power_speeds(utcTimeStamp timestamp, value double)
    partitioned by (masterid string,type string)
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new'
    tblproperties ("skip.header.line.count"="1");
    
  2. Add a partition

    alter table enos_power_speeds add partition(masterid='CGNWF0046',type='powers')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/powers';
    
    alter table enos_power_speeds add partition(masterid='CGNWF0046',type='speeds')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/speeds';
    
  3. Specify the batch data processing queue name in Hive (requested through resource management):

    set mapreduce.job.queuename=root.test_enos_01;
    
  4. Check and verification:

    select count(1) from enos_power_speeds where masterid='CGNWF0046' and type='powers'
    
  5. Use the file speed_to_power.csv to create the table speed_to_power:

    create external table speed_to_power(speed double,power double)
    
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/speed_to_power'
    tblproperties ("skip.header.line.count"="1");
    

Prepare External Weather Data

Create an external weather data table by following these steps:

  1. Create the table external_weather and perform partitioned by weather data source type (EC and GFS):

    create external table external_weather(utcTimeStamp timestamp, pres double,tmp double,wd double,ws double, weatherpublishstarttime timestamp)
    partitioned by (type string)
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938'
    tblproperties ("skip.header.line.count"="1");
    
  2. Add a partition

    alter table external_weather add partition(type='EC')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/EC';
    
    alter table external_weather add partition(type='GFS')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/GFS';
    
  3. Create a group:

    create table external_weather1  as
    
    select
    
    utctimestamp,
    
    avg(pres) as pres,
    
    avg(tmp) as tmp,
    
    avg(wd) as wd,
    
    avg(ws) as ws,
    
    max(weatherpublishstarttime) as weatherpublishstarttime,
    
    type
    
    from
    
    external_weather
    
    group by type, utctimestamp;
    
  4. Check if the group processing is effective:

    select * from external_weather1 where utcTimeStamp='2018-07-03';
    

Integrate Final Data

Create the following Hive table to integrate data:

create table kmmlds2 as

select

t1.utctimestamp as X_basic_time,

hour(t1.utctimestamp) as X_basic_hour,

current_timestamp as X_basic_forecast_time,

 (row_number() over(partition by 1)     - 1)%49  as X_basic_horizon,

datediff(t1.utctimestamp,'2018-07-01') as i_set,

t3.weatherpublishstarttime as EC_nwp_time,

t3.ws as  EC_ws,

t3.wd as  EC_wd,

t3.tmp as  EC_tmp,

t3.pres as  EC_press,

t3.rho as  EC_rho,

t3.dist as  EC_dist,

t4.weatherpublishstarttime as GFS_nwp_time,

t4.ws as  GFS_ws,

t4.wd as  GFS_wd,

t4.tmp as  GFS_tmp,

t4.pres as  GFS_press,

t4.rho as  GFS_rho,

t4.dist as  GFS_dist,

t1.value as speed,

t2.value as power

from

(select utctimestamp,value from enos_power_speeds where  masterid='CGNWF0046' and type='speeds' and value is not null) t1

left join

(select utctimestamp, value from enos_power_speeds where masterid='CGNWF0046' and type='powers' and value is not null ) t2

on t1.utctimestamp =t2.utctimestamp

left join

(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='EC') t3

on t1.utctimestamp =t3.utctimestamp

left join

(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='GFS') t4

on t1.utctimestamp =t4.utctimestamp


hdfs dfs -cp /user/hive/warehouse/data_o15632609593521.db/kmmlds2/000000_0 /user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046/


CREATE external TABLE `kmmlds1`(
`x_basic_time` timestamp,
`x_basic_hour` int,
`x_basic_forecast_time` timestamp,
`x_basic_horizon` int,
`i_set` int,
`ec_nwp_time` timestamp,
`ec_ws` double,
`ec_wd` double,
`ec_tmp` double,
`ec_press` double,
`ec_rho` double,
`ec_dist` int,
`gfs_nwp_time` timestamp,
`gfs_ws` double,
`gfs_wd` double,
`gfs_tmp` double,
`gfs_press` double,
`gfs_rho` double,
`gfs_dist` int,
`speed` double,
`power` double)
partitioned by(masterid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/data_o15632609593521.db/kmmlds1';

alter table kmmlds1 add if not exists partition(masterid='ABCDE0001')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/ABCDE0001';

alter table kmmlds1 add if not exists partition(masterid='CGNWF0046')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046';

Use External Weather Data

CREATE external TABLE `kmmlds91x`(

`sequence` int,

`i_set` int,

`x_basic_forecast_time` timestamp,

`x_basic_horizon` int,
`x_basic_time` timestamp,
`x_basic_hour` int,
`ec_nwp_time` timestamp,

`ec_dist` int,
`ec_ws` double,
`ec_wd` double,

`ec_rho` double,

`ec_press` double,
`ec_tmp` double,

`gfs_nwp_time` timestamp,

`gfs_dist` int,
`gfs_ws` double,
`gfs_wd` double,

`gfs_rho` double,

`gfs_press` double,
`gfs_tmp` double
)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91x'
tblproperties ("skip.header.line.count"="1");


CREATE external TABLE `kmmlds91y`(

`sequence` int,

`speed` double,
`power` double

)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91y'
tblproperties ("skip.header.line.count"="1");

Create Status Table

  1. Use the following command to create a status table to record daily data updates:

    create table status_tbl(masterid string, updatetime timestamp, flag int);
    
  2. Insert status flag:

    insert into status_tbl values('ABCDE0001','2020-06-13',1);
    

Description of Integrated Data Structure

The data type and description of the integrated data are shown in the following table:

Field name

Type

Description

x_basic_time

timestamp

Time field, where one piece of data with universal time label is generated per hour

x_basic_hour

int

Hours that are gotten from x_basic_time

x_basic_forecast_time

timestamp

The value at the zero point on the second day is taken generally

x_basic_horizon

int

Number number: 0~48 (repeated)

i_set

int

Numbers corresponding to horizon (0, 1, 2…)

ec_nwp_time

timestamp

Numerical weather prediction, which refers to the EC weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day

ec-ws

double

wind speed

ec-wd

double

Wind direction

ec-tmp

double

temperature

ec-pres

double

Air pressure

ec-rho

double

Air density

ec-dist

double

Count that has no practical meaning

gfs_nwp_time

timestamp

Numerical weather prediction, which refers to the GFS weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day

gfs-ws

double

Wind speed

gfs-wd

double

Wind direction

gfs-tmp

double

Temperature

gfs-pres

double

Air pressure

gfs-rho

double

Air density

gfs-dist

double

Count that has no practical meaning

speed

double

Actual wind speed

power

double

Actual power value

Next Unit

Designing a Pipeline