机器学习-回归

前言

此次学习的课程为李宏毅机器学习，之前学过一遍吴恩达的课程，只可惜当时没记笔记，且近些时候没有写代码，逐渐疏忽了，故选择李宏毅再进行新一遍的学习，所谓温故而知新。

回归是我们通常会使用的机器学习中的一类，比如日常中的我们的身高预测，股票预测等等，这些都可以看作为粗略的回归。

举一个例子

小时候我们会玩一个叫赛尔号的游戏，游戏里有各种各样的精力，就好比我们捕捉到了一只雷伊，然后我们可以向雷伊投经验值，让他升级，这只雷伊会有一个攻击力，我们想要预测雷伊的各种各样的属性与其攻击力之间的关系。

于是我们设其血量为\(X_{hp}\)，其体重为\(x_{w}\)，其身高为\(x_{h}\)，其物种为\(x_s\)，其战斗力为\(x_{cp}\)，然后预测他进化之后的战斗力值。那么便有\(y=b+\sum w_ix_i\)，其中\(w_i:weight,b:bias\)。如果单一个\(x_{cp}\)作预测的话便是\(y=b+w\cdot x_{cp}\)。

收集到了数据之后，便是可以进行预测，在此我们使用一个名为Loss函数进行Loss计算

\[ L(f) = \sum^{10}_{n=1}(\hat y^n-f(x^n_{cp}))^2\\ L(f) = \sum^{10}_{n=1}(\hat y^n-(b+w\cdot x^n_{cp}))^2 \]

我们想要的结果是需要 loss 函数尽可能的小，我们需要选择一个最好的函数，在此我们通过梯度下降来进行函数优化，进行如此迭代。

\[ w^1 = w^0-\alpha\frac{dL}{dw}|_{w=w^0}\\ w^2 = w^1-\alpha\frac{dL}{dw}|_{w=w^1} \]

对于我们有两个参数的函数来说，也是一样可以进行迭代

\[ w^1 = w^0-\alpha\frac{\partial L}{\partial w}|_{w=w^0,b=b^0},b^1 = b^0-\alpha\frac{\partial L}{\partial b}|_{w=w^0,b=b^0} \]

沿着梯度方向逐渐减小，直到让 loss 函数最小，偏微分公式如下

\[ \frac{\partial L}{\partial w}=\sum^{10}_{n=1}2(\hat y^n-(b+w\cdot x^n_{cp}))(-x^n_{cp}) \]

那么经过迭代之后便可以求得一个较为合适的\(w和b\)，便是得到了方程。

实践

此次的实践为一项 PM.2.5 预测任务

数据使用丰原站的观测记录，分成 train set 跟 test set，train set 是丰原站每个月的前 20 天所有资料。test set 则是从丰原站剩下的资料中取样出来。
train.csv: 每个月前 20 天的完整资料。
test.csv : 从剩下的资料当中取样出连续的 10 小时为一笔，前九小时的所有观测数据当作 feature，第十小时的 PM2.5 当作 answer。一共取出 240 笔不重複的 test data，请根据 feature 预测这 240 笔的 PM2.5。
Data 含有 18 项观测数据 AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

1
2
3

import sys
import pandas as pd
import numpy as np

1
2
3

data = pd.read_csv('work/hw1_data/train.csv',encoding='big5')
data[data=='NR']=0
data

	日期	測站	測項	0	1	2	3	4	5	6	...	14	15	16	17	18	19	20	21	22	23
0	2014/1/1	豐原	AMB_TEMP	14	14	14	13	12	12	12	...	22	22	21	19	17	16	15	15	15	15
1	2014/1/1	豐原	CH4	1.8	1.8	1.8	1.8	1.8	1.8	1.8	...	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8
2	2014/1/1	豐原	CO	0.51	0.41	0.39	0.37	0.35	0.3	0.37	...	0.37	0.37	0.47	0.69	0.56	0.45	0.38	0.35	0.36	0.32
3	2014/1/1	豐原	NMHC	0.2	0.15	0.13	0.12	0.11	0.06	0.1	...	0.1	0.13	0.14	0.23	0.18	0.12	0.1	0.09	0.1	0.08
4	2014/1/1	豐原	NO	0.9	0.6	0.5	1.7	1.8	1.5	1.9	...	2.5	2.2	2.5	2.3	2.1	1.9	1.5	1.6	1.8	1.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4315	2014/12/20	豐原	THC	1.8	1.8	1.8	1.8	1.8	1.7	1.7	...	1.8	1.8	2	2.1	2	1.9	1.9	1.9	2	2
4316	2014/12/20	豐原	WD_HR	46	13	61	44	55	68	66	...	59	308	327	21	100	109	108	114	108	109
4317	2014/12/20	豐原	WIND_DIREC	36	55	72	327	74	52	59	...	18	311	52	54	121	97	107	118	100	105
4318	2014/12/20	豐原	WIND_SPEED	1.9	2.4	1.9	2.8	2.3	1.9	2.1	...	2.3	2.6	1.3	1	1.5	1	1.7	1.5	2	2
4319	2014/12/20	豐原	WS_HR	0.7	0.8	1.8	1	1.9	1.7	2.1	...	1.3	1.7	0.7	0.4	1.1	1.4	1.3	1.6	1.8	2

4320 rows × 27 columns

1 2	raw_data = data.iloc[:,3:] raw_data

	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
0	14	14	14	13	12	12	12	12	15	17	...	22	22	21	19	17	16	15	15	15	15
1	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	...	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8	1.8
2	0.51	0.41	0.39	0.37	0.35	0.3	0.37	0.47	0.78	0.74	...	0.37	0.37	0.47	0.69	0.56	0.45	0.38	0.35	0.36	0.32
3	0.2	0.15	0.13	0.12	0.11	0.06	0.1	0.13	0.26	0.23	...	0.1	0.13	0.14	0.23	0.18	0.12	0.1	0.09	0.1	0.08
4	0.9	0.6	0.5	1.7	1.8	1.5	1.9	2.2	6.6	7.9	...	2.5	2.2	2.5	2.3	2.1	1.9	1.5	1.6	1.8	1.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4315	1.8	1.8	1.8	1.8	1.8	1.7	1.7	1.8	1.8	1.8	...	1.8	1.8	2	2.1	2	1.9	1.9	1.9	2	2
4316	46	13	61	44	55	68	66	70	66	85	...	59	308	327	21	100	109	108	114	108	109
4317	36	55	72	327	74	52	59	83	106	105	...	18	311	52	54	121	97	107	118	100	105
4318	1.9	2.4	1.9	2.8	2.3	1.9	2.1	3.7	2.8	3.8	...	2.3	2.6	1.3	1	1.5	1	1.7	1.5	2	2
4319	0.7	0.8	1.8	1	1.9	1.7	2.1	2	2	1.7	...	1.3	1.7	0.7	0.4	1.1	1.4	1.3	1.6	1.8	2

4320 rows × 24 columns

mouth_data = {}
for mouth in range(12):
    sample = np.empty([18,480])
    for day in range(20):
        sample[:,day*24:(day+1)*24]=raw_data[18*(mouth*20+day):18*(mouth*20+day+1)]
        # 数据每个月有 20 天，每天有 24 小时，每个月有 480 小时的数据
        # 每天有 18 项数据
        # 将每个月 20 天的数据，20*24=480 个数据排列在一行上面
    mouth_data[mouth] = sample

根据题目描述，我们要根据前九个小时的数据去预测第十个小时的数据，所以我们需要将数据再次进行切分，每个前九个小时都是x，每个第十个小时都是y，每个月一共有480个小时，所以每个月 y 的数量为 480-9 =471 个，每个 y 对应 18*9 个特征

x = np.empty([12*471,18*9],dtype=float)
y = np.empty([12*471,1],dtype=float)
for mouth in range(12):
    for day in range(20):
        for hour in range(24):
            if day==19 or hour>14:
                continue
            x[mouth*471+day*24+hour,:] = mouth_data[mouth][:,day*24+hour:day*24+hour+9].reshape(1,-1)
            y[mouth*471+day*24+hour,0] = mouth_data[mouth][9,day*24+hour+9]

目前 x 为12 * 471 行， 18*9 列

目前 y 为12 * 471 行， 1 列

将 x 进行归一化

mean_x = np.mean(x,axis=0) # 每一列的均值
std_x = np.std(x,axis=0) # 每一列的标准差
for i in range(12*471):
    for j in range(18*9):
        if std_x[j]!=0:
            x[i][j] = (x[i][j]-mean_x[j])/std_x[j]

# 将数据集进行划分，训练集：测试集 = 4：1
import math
x_train_set = x[:math.floor(len(x)*0.8),:]
y_train_set = y[:math.floor(len(y)*0.8),:]

x_validation = x[math.floor(len(x)*0.8):,:]
y_validation = y[math.floor(len(y)*0.8):,:]
print(len(x_train_set),len(x_train_set),len(x_validation),len(y_validation))
x = np.concatenate((np.zeros([12*471,1]),x),axis=1).astype(float)

4521 4521 1131 1131

使用梯度下降进行训练

dim = 18 * 9 + 1
w = np.zeros([dim, 1])
learning_rate = 0.000001
iter_time = 1000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
#     adagrad += gradient ** 2
#     w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
    w = w - learning_rate * gradient
np.save('work/weight.npy', w)

0:23.067503022281024
100:16.01469450959162
200:15.785217268902825
300:15.667044002058859
400:15.59344540214558
500:15.54253683834305
600:15.504902944004627
700:15.475801107300377
800:15.452554803514973
900:15.433523145338306

读取测试数据

testdata = pd.read_csv('work/hw1_data/test.csv',header=None, encoding='big5')
testdata = testdata.iloc[:,2:]
testdata[testdata=='NR']=0
test_data = testdata.to_numpy()
test_x = np.empty([240,18*9],dtype=float)
for i in range(240):
    test_x[i,:] = test_data[18*i:18*(i+1),:].reshape(1,-1)
# 归一化
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j]!=0:
            test_x[i][j] = (test_x[i][j]-mean_x[j])/std_x[j]
test_x = np.concatenate((np.ones([240,1]),test_x),axis=1).astype(float)

进行预测

1
2
3

w = np.load('work/weight.npy')
ans_y = np.dot(test_x,w)
ans_y

保存到 CSV 文件

import csv
with open('work/submit.csv',mode='w',newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id','value']
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_'+str(i),ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)

如此便是完成了预测，并未找到对比的真实数据，等找到真实数据再去测试正确率