Unit 1. Preparing Data¶
Before training a machine learning model, you need to prepare sample data for model training.
Prerequisites¶
You should upload the wind power algorithm files and weather data files accumulated in the wind field to EnOS HDFS in advance to create the data tables required for model training. In this tutorial, the HDFS path for storing files is hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/
.
Prepare Wind Power Algorithm Data¶
You can use map the files uploaded to EnOS HDFS to the Hive partition table, and use the wind farm ID (masterid) as the partition key to simulate the actual scenarios by following these steps.
Use
powers
andspeeds
files to create a Hive partition table (enos_power_speeds) for partition by site (masterid) and type. The value corresponding to thepowers
type ispower
(power), while the value corresponding to thespeeds
type isspeed
(wind speed).create external table enos_power_speeds(utcTimeStamp timestamp, value double) partitioned by (masterid string,type string) row format delimited fields terminated by ',' lines terminated by 'n' location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new' tblproperties ("skip.header.line.count"="1");
Add a partition
alter table enos_power_speeds add partition(masterid='CGNWF0046',type='powers') location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/powers'; alter table enos_power_speeds add partition(masterid='CGNWF0046',type='speeds') location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/speeds';
Specify the batch data processing queue name in Hive (requested through resource management):
set mapreduce.job.queuename=root.test_enos_01;
Check and verification:
select count(1) from enos_power_speeds where masterid='CGNWF0046' and type='powers'
Use the file
speed_to_power.csv
to create the tablespeed_to_power
:create external table speed_to_power(speed double,power double) row format delimited fields terminated by ',' lines terminated by 'n' location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/speed_to_power' tblproperties ("skip.header.line.count"="1");
Prepare External Weather Data¶
Create an external weather data table by following these steps:
Create the table
external_weather
and perform partitioned by weather data source type (EC and GFS):create external table external_weather(utcTimeStamp timestamp, pres double,tmp double,wd double,ws double, weatherpublishstarttime timestamp) partitioned by (type string) row format delimited fields terminated by ',' lines terminated by 'n' location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938' tblproperties ("skip.header.line.count"="1");
Add a partition
alter table external_weather add partition(type='EC') location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/EC'; alter table external_weather add partition(type='GFS') location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/GFS';
Create a group:
create table external_weather1 as select utctimestamp, avg(pres) as pres, avg(tmp) as tmp, avg(wd) as wd, avg(ws) as ws, max(weatherpublishstarttime) as weatherpublishstarttime, type from external_weather group by type, utctimestamp;
Check if the group processing is effective:
select * from external_weather1 where utcTimeStamp='2018-07-03';
Integrate Final Data¶
Create the following Hive table to integrate data:
create table kmmlds2 as
select
t1.utctimestamp as X_basic_time,
hour(t1.utctimestamp) as X_basic_hour,
current_timestamp as X_basic_forecast_time,
(row_number() over(partition by 1) - 1)%49 as X_basic_horizon,
datediff(t1.utctimestamp,'2018-07-01') as i_set,
t3.weatherpublishstarttime as EC_nwp_time,
t3.ws as EC_ws,
t3.wd as EC_wd,
t3.tmp as EC_tmp,
t3.pres as EC_press,
t3.rho as EC_rho,
t3.dist as EC_dist,
t4.weatherpublishstarttime as GFS_nwp_time,
t4.ws as GFS_ws,
t4.wd as GFS_wd,
t4.tmp as GFS_tmp,
t4.pres as GFS_press,
t4.rho as GFS_rho,
t4.dist as GFS_dist,
t1.value as speed,
t2.value as power
from
(select utctimestamp,value from enos_power_speeds where masterid='CGNWF0046' and type='speeds' and value is not null) t1
left join
(select utctimestamp, value from enos_power_speeds where masterid='CGNWF0046' and type='powers' and value is not null ) t2
on t1.utctimestamp =t2.utctimestamp
left join
(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='EC') t3
on t1.utctimestamp =t3.utctimestamp
left join
(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='GFS') t4
on t1.utctimestamp =t4.utctimestamp
hdfs dfs -cp /user/hive/warehouse/data_o15632609593521.db/kmmlds2/000000_0 /user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046/
CREATE external TABLE `kmmlds1`(
`x_basic_time` timestamp,
`x_basic_hour` int,
`x_basic_forecast_time` timestamp,
`x_basic_horizon` int,
`i_set` int,
`ec_nwp_time` timestamp,
`ec_ws` double,
`ec_wd` double,
`ec_tmp` double,
`ec_press` double,
`ec_rho` double,
`ec_dist` int,
`gfs_nwp_time` timestamp,
`gfs_ws` double,
`gfs_wd` double,
`gfs_tmp` double,
`gfs_press` double,
`gfs_rho` double,
`gfs_dist` int,
`speed` double,
`power` double)
partitioned by(masterid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/data_o15632609593521.db/kmmlds1';
alter table kmmlds1 add if not exists partition(masterid='ABCDE0001')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/ABCDE0001';
alter table kmmlds1 add if not exists partition(masterid='CGNWF0046')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046';
Use External Weather Data¶
CREATE external TABLE `kmmlds91x`(
`sequence` int,
`i_set` int,
`x_basic_forecast_time` timestamp,
`x_basic_horizon` int,
`x_basic_time` timestamp,
`x_basic_hour` int,
`ec_nwp_time` timestamp,
`ec_dist` int,
`ec_ws` double,
`ec_wd` double,
`ec_rho` double,
`ec_press` double,
`ec_tmp` double,
`gfs_nwp_time` timestamp,
`gfs_dist` int,
`gfs_ws` double,
`gfs_wd` double,
`gfs_rho` double,
`gfs_press` double,
`gfs_tmp` double
)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91x'
tblproperties ("skip.header.line.count"="1");
CREATE external TABLE `kmmlds91y`(
`sequence` int,
`speed` double,
`power` double
)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91y'
tblproperties ("skip.header.line.count"="1");
Create Status Table¶
Use the following command to create a status table to record daily data updates:
create table status_tbl(masterid string, updatetime timestamp, flag int);
Insert status flag:
insert into status_tbl values('ABCDE0001','2020-06-13',1);
Description of Integrated Data Structure¶
The data type and description of the integrated data are shown in the following table:
Field name |
Type |
Description |
---|---|---|
x_basic_time |
timestamp |
Time field, where one piece of data with universal time label is generated per hour |
x_basic_hour |
int |
Hours that are gotten from x_basic_time |
x_basic_forecast_time |
timestamp |
The value at the zero point on the second day is taken generally |
x_basic_horizon |
int |
Number number: 0~48 (repeated) |
i_set |
int |
Numbers corresponding to horizon (0, 1, 2…) |
ec_nwp_time |
timestamp |
Numerical weather prediction, which refers to the EC weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day |
ec-ws |
double |
wind speed |
ec-wd |
double |
Wind direction |
ec-tmp |
double |
temperature |
ec-pres |
double |
Air pressure |
ec-rho |
double |
Air density |
ec-dist |
double |
Count that has no practical meaning |
gfs_nwp_time |
timestamp |
Numerical weather prediction, which refers to the GFS weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day |
gfs-ws |
double |
Wind speed |
gfs-wd |
double |
Wind direction |
gfs-tmp |
double |
Temperature |
gfs-pres |
double |
Air pressure |
gfs-rho |
double |
Air density |
gfs-dist |
double |
Count that has no practical meaning |
speed |
double |
Actual wind speed |
power |
double |
Actual power value |