线性回归
线性回归应用场景
- 房价预测
- 销售额度预测
- 金融:贷款额度预测、利用线性回归以及系数分析因子
什么是线性回归
线性回归(Linear regression)是利用回归方程(函数)对一个或多个自变量(特征值)和因变量(目标值)之间关系进行建模的一种分析方式。
特点:只有一个自变量的情况称为单变量回归,大于一个自变量情况的叫做多元回归 线性回归公式
通用公式:
h(w) = w_1x_1 + w_2x_2 + w_3x_3... + b = w^Tx +b
其中w,x 可以理解为矩阵:
w= \begin{pmatrix}
{b}\\
{w_{1}}\\
{w_{2}}\\
\end{pmatrix},
x = \begin{pmatrix}
{1}\\
{x_{1}}\\
{x_{2}}\\
\end{pmatrix}
那么怎么理解呢?我们来看几个例子
- 期末成绩:0.7×考试成绩+0.3×平时成绩
- 房子价格 = 0.02×中心区域的距离 + 0.04×城市一氧化氮浓度 + (-0.12×自住房平均房价) + 0.254×城镇犯罪率
上面两个例子,我们看到特征值与目标值之间建立的一个关系,这个可以理解为回归方程。
损失函数
J(\theta) = (h_w(x_1)-y_1)^2 + (h_w(x_1)-y_1)^2 + ... + (h_w(x_m) - ym)^2
= \sum _{i=1}{^m}(h_w(x_i)-y_i)^2
- y_i为第i个训练样本的真实值
- h(x_i)为第i个训练样本特征值组合预测函数
- 又称最小二乘法
如何去减少这个损失,使我们预测的更加准确些?既然存在了这个损失,我们一直说机器学习有自动学习的功能,在线性回归这里更是能够体现。这里可以通过一些优化方法去优化(其实是数学当中的求导功能)回归的总损失!!!
优化算法
如何去求模型当中的W,使得损失最小?(目的是找到最小损失对应的W值)
线性回归经常使用的两种优化算法 正规方程
和 梯度下降
正规方程
w = (X^T X) ^{-1}X^Ty
理解:X为特征值矩阵,y为目标值矩阵。直接求到最好的结果
缺点:当特征过多过复杂时,求解速度太慢并且得不到结果
梯度下降 (Gradient Descent)
w1 := w1 - \alpha \frac{\delta cost(w0+w1x1) }{\delta w1}
w0 := w0 - \alpha \frac{\delta cost(w0+w1x1) }{\delta w1}
理解:α为学习速率,需要手动指定(超参数),α旁边的整体表示方向
沿着这个函数下降的方向找,最后就能找到山谷的最低点,然后更新W值
使用:面对训练数据规模十分庞大的任务 ,能够找到较好的结果
线性回归API
- sklearn.linear_model.LinearRegression(fit_intercept=True)
- 通过正规方程优化
- fit_intercept:是否计算偏置
- LinearRegression.coef_:回归系数
- LinearRegression.intercept_:偏置
- sklearn.linear_model.SGDRegressor(loss="squared_loss", fit_intercept=True, learning_rate ='invscaling', eta0=0.01)
- SGDRegressor类实现了随机梯度下降学习,它支持不同的loss函数和正则化惩罚项来拟合线性回归模型。
- loss:损失类型
- loss="squared_loss": 普通最小二乘法
- fit_intercept:是否计算偏置
- learning_rate : string, optional
- 学习率填充
- 'constant': eta = eta0
- 'optimal': eta = 1.0 / (alpha * (t + t0)) [default]
- 'invscaling': eta = eta0 / pow(t, power_t)
- power_t=0.25:存在父类当中
- 对于一个常数值的学习率来说,可以使用learning_rate='constant' ,并使用eta0来指定学习率。
- SGDRegressor.coef_:回归系数
- SGDRegressor.intercept_:偏置sklearn提供给我们两种实现的API, 可以根据选择使用
回归性能评估
均方误差(Mean Squared Error)MSE)评价机制:
MSE := \frac{1}{m} \sum_{i=1}^m (y^i - \stackrel{-}{y})^2
注:y^i为预测值,¯y为真实值
- sklearn.metrics.mean_squared_error(y_true, y_pred)
- 均方误差回归损失
- y_true:真实值
- y_pred:预测值
- return:浮点数结果
梯度下降的优化方法GD、SGD、SAG
GD
梯度下降(Gradient Descent),原始的梯度下降法需要计算所有样本的值才能够得出梯度,计算量大,所以后面才有会一系列的改进。
SGD
随机梯度下降(Stochastic gradient descent)是一个优化方法。它在一次迭代时只考虑一个训练样本。
SGD的优点是:
- 高效
- 容易实现
- SGD的缺点是:
- SGD需要许多超参数:比如正则项参数、迭代数。
- SGD对于特征标准化是敏感的。
SAG
随机平均梯度法(Stochasitc Average Gradient),由于收敛的速度太慢,有人提出SAG等基于梯度下降的算法
Scikit-learn:SGDRegressor、岭回归、逻辑回归等当中都会有SAG优化
案例 代码
# 线性模型包括线性关系和非线性关系两种
# 线性模型包括参数一次幂和自变量一次幂
# 线性关系一定是线性模型, 反之不一定
# 优化方法有两种: 一种是正规方程, 第二种是梯度下降
# 这部分用来训练预测房价
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, RidgeCV
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error # 均方误差
def load_data():
boston_data = load_boston()
print("特征数量为:(样本数,特征数)", boston_data.data.shape)
x_train, x_test, y_train, y_test = train_test_split(boston_data.data,
boston_data.target, random_state=22)
return x_train, x_test, y_train, y_test
# 正规方程
def linear_Regression():
"""
正规方程的优化方法
不能解决拟合问题
一次性求解
针对小数据
:return:
"""
x_train, x_test, y_train, y_test = load_data()
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
estimator = LinearRegression()
estimator.fit(x_train, y_train)
print("正规方程_权重系数为: ", estimator.coef_)
print("正规方程_偏置为:", estimator.intercept_)
y_predict = estimator.predict(x_test)
error = mean_squared_error(y_test, y_predict)
print("正规方程_房价预测:", y_predict)
print("正规方程_均分误差:", error)
return None
# 梯度下降
def linear_SGDRegressor():
"""
梯度下降的优化方法
迭代求解
针对大数据
:return:
"""
x_train, x_test, y_train, y_test = load_data()
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 建议看下这个函数的api, 这些值都是默认值
# estimator = SGDRegressor(loss="squared_loss", fit_intercept=True, eta0=0.01,
# power_t=0.25)
estimator = SGDRegressor(learning_rate="constant", eta0=0.01, max_iter=10000)
# estimator = SGDRegressor(penalty='l2', loss="squared_loss") # 这样设置就相当于岭回归, 但是建议用Ridge方法
estimator.fit(x_train, y_train)
print("梯度下降_权重系数为: ", estimator.coef_)
print("梯度下降_偏置为:", estimator.intercept_)
y_predict = estimator.predict(x_test)
error = mean_squared_error(y_test, y_predict)
print("梯度下降_房价预测:", y_predict)
print("梯度下降_均分误差:", error)
return None
def linear_Ridge():
"""
Ridge: 岭回归方法
:return:
"""
x_train, x_test, y_train, y_test = load_data()
transfer = StandardScaler() # 建议使用标准化处理数据
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
estimator = Ridge(max_iter=10000, alpha=0.5) # 岭回归
# estimator = RidgeCV(alphas=[0.1, 0.2, 0.3, 0.5]) # 加了交叉验证的岭回归
estimator.fit(x_train, y_train)
print("岭回归_权重系数为: ", estimator.coef_)
print("岭回归_偏置为:", estimator.intercept_)
y_predict = estimator.predict(x_test)
error = mean_squared_error(y_test, y_predict)
print("岭回归_房价预测:", y_predict)
print("岭回归_均分误差:", error)
return None
if __name__ == '__main__':
linear_Regression()
linear_SGDRegressor()
linear_Ridge()
结果
特征数量为:(样本数,特征数) (506, 13)
正规方程_权重系数为: [-0.64817766 1.14673408 -0.05949444 0.74216553 -1.95515269 2.70902585
-0.07737374 -3.29889391 2.50267196 -1.85679269 -1.75044624 0.87341624
-3.91336869]
正规方程_偏置为: 22.62137203166228
正规方程_房价预测: [28.22944896 31.5122308 21.11612841 32.6663189 20.0023467 19.07315705
21.09772798 19.61400153 19.61907059 32.87611987 20.97911561 27.52898011
15.54701758 19.78630176 36.88641203 18.81202132 9.35912225 18.49452615
30.66499315 24.30184448 19.08220837 34.11391208 29.81386585 17.51775647
34.91026707 26.54967053 34.71035391 27.4268996 19.09095832 14.92742976
30.86877936 15.88271775 37.17548808 7.72101675 16.24074861 17.19211608
7.42140081 20.0098852 40.58481466 28.93190595 25.25404307 17.74970308
38.76446932 6.87996052 21.80450956 25.29110265 20.427491 20.4698034
17.25330064 26.12442519 8.48268143 27.50871869 30.58284841 16.56039764
9.38919181 35.54434377 32.29801978 21.81298945 17.60263689 22.0804256
23.49262401 24.10617033 20.1346492 38.5268066 24.58319594 19.78072415
13.93429891 6.75507808 42.03759064 21.9215625 16.91352899 22.58327744
40.76440704 21.3998946 36.89912238 27.19273661 20.97945544 20.37925063
25.3536439 22.18729123 31.13342301 20.39451125 23.99224334 31.54729547
26.74581308 20.90199941 29.08225233 21.98331503 26.29101202 20.17329401
25.49225305 24.09171045 19.90739221 16.35154974 15.25184758 18.40766132
24.83797801 16.61703662 20.89470344 26.70854061 20.7591883 17.88403312
24.28656105 23.37651493 21.64202047 36.81476219 15.86570054 21.42338732
32.81366203 33.74086414 20.61688336 26.88191023 22.65739323 17.35731771
21.67699248 21.65034728 27.66728556 25.04691687 23.73976625 14.6649641
15.17700342 3.81620663 29.18194848 20.68544417 22.32934783 28.01568563
28.58237108]
正规方程_均分误差: 20.6275137630954
特征数量为:(样本数,特征数) (506, 13)
梯度下降_权重系数为: [-1.04364677 1.04012133 -0.32795148 1.50810169 -1.80799894 3.09581791
-0.2448096 -3.53801925 2.30450936 -1.97822406 -2.02534199 1.16418724
-4.57273239]
梯度下降_偏置为: [22.84757248]
梯度下降_房价预测: [29.68343132 33.4649779 21.65534113 37.2744436 20.2168242 18.04268748
21.59750332 20.06731514 20.00246026 34.58427777 21.31221344 27.30835118
14.69640661 19.56967973 39.18122401 18.80184486 8.2635644 18.38495869
32.70417996 25.2530655 17.99073665 36.43375433 34.00286866 15.41240004
37.28675323 27.76783217 36.7588099 28.86921782 17.23043897 14.94613895
32.72206925 14.58519033 39.74531945 1.17929187 15.79689483 15.24682753
2.88288234 18.79504169 46.57756401 30.9978196 26.38109308 16.25789249
43.50267386 3.27378129 21.02810353 26.12907329 24.02543035 20.40637932
19.42557732 25.20037357 6.40268287 28.92483902 34.9999772 13.97527427
5.69462171 38.11385304 33.62010227 22.88758356 17.41771121 22.68927359
23.99646882 24.89873183 20.66010263 41.27511165 26.10099596 18.57889113
11.33387277 2.76703102 47.97552732 22.43924355 14.26382243 23.72700453
44.14770686 22.07638935 39.70975664 28.56167544 21.86372232 19.92235624
26.82787882 23.7074095 33.27924587 20.85367214 27.4932459 33.15910854
27.85949469 20.03189974 30.58283679 23.02442435 27.39497594 19.64573775
26.01073143 27.23507187 18.82268992 11.08717524 12.69571468 17.36183044
25.82223584 14.28948517 19.69879807 27.95863231 19.37527804 16.43066936
24.77348089 23.95527634 21.54063409 42.00723682 15.49339583 22.4190624
34.7342756 37.27889691 20.99417894 27.31720563 26.4511438 17.42328726
22.43170768 22.11483073 28.86388862 25.99491102 24.4952229 12.63038333
11.431302 -0.59333438 30.86738312 19.96951551 23.05014807 29.56912465
29.60285661]
梯度下降_均分误差: 24.38311211436002
特征数量为:(样本数,特征数) (506, 13)
岭回归_权重系数为: [-0.64193209 1.13369189 -0.07675643 0.74427624 -1.93681163 2.71424838
-0.08171268 -3.27871121 2.45697934 -1.81200596 -1.74659067 0.87272606
-3.90544403]
岭回归_偏置为: 22.62137203166228
岭回归_房价预测: [28.22536271 31.50554479 21.13191715 32.65799504 20.02127243 19.07245621
21.10832868 19.61646071 19.63294981 32.85629282 20.99521805 27.5039205
15.55295503 19.79534148 36.87534254 18.80312973 9.39151837 18.50769876
30.66823994 24.3042416 19.08011554 34.10075629 29.79356171 17.51074566
34.89376386 26.53739131 34.68266415 27.42811508 19.08866098 14.98888119
30.85920064 15.82430706 37.18223651 7.77072879 16.25978968 17.17327251
7.44393003 19.99708381 40.57013125 28.94670553 25.25487557 17.75476957
38.77349313 6.87948646 21.78603146 25.27475292 20.4507104 20.47911411
17.25121804 26.12109499 8.54773286 27.48936704 30.58050833 16.56570322
9.40627771 35.52573005 32.2505845 21.8734037 17.61137983 22.08222631
23.49713296 24.09419259 20.15174912 38.49803353 24.63926151 19.77214318
13.95001219 6.7578343 42.03931243 21.92262496 16.89673286 22.59476215
40.75560357 21.42352637 36.88420001 27.18201696 21.03801678 20.39349944
25.35646095 22.27374662 31.142768 20.39361408 23.99587493 31.54490413
26.76213545 20.8977756 29.0705695 21.99584672 26.30581808 20.10938421
25.47834262 24.08620166 19.90788343 16.41215513 15.26575844 18.40106165
24.82285704 16.61995784 20.87907604 26.70640134 20.75218143 17.88976552
24.27287641 23.36686439 21.57861455 36.78815164 15.88447635 21.47747831
32.80013402 33.71367379 20.61690009 26.83175792 22.69265611 17.38149366
21.67395385 21.67101719 27.6669245 25.06785897 23.73251233 14.65355067
15.19441045 3.81755887 29.1743764 20.68219692 22.33163756 28.01411044
28.55668351]
岭回归_均分误差: 20.641771606180907