Skip to main content

随机森林

集成学习方法

集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型,各自独立地学习和作出预测。这些预测最后结合成组合预测,因此优于任何一个单分类的做出预测。

随机森林

在机器学习中,随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。

随机森林原理过程

  • 用N来表示训练用例(样本)的个数,M表示特征数目。
    • 1、一次随机选出一个样本,重复N次, (有可能出现重复的样本)
    • 2、随机去选出m个特征, m <<M,建立决策树
  • 采取bootstrap抽样

    训练值随机 - N个样本中随机有放回抽样N个
    BootStrap 随机有放回抽样
    特征随机 - 从M个特征中抽取m个特征
    M >> m 可起到降维的效果

为什么采用BootStrap抽样

  • 为什么要随机抽样训练集?  
    • 如果不进行随机抽样,每棵树的训练集都一样,那么最终训练出的树分类结果也是完全一样的
  • 为什么要有放回地抽样?
    • 如果不是有放回的抽样,那么每棵树的训练样本都是不同的,都是没有交集的,这样每棵树都是“有偏的”,都是绝对“片面的”(当然这样说可能不对),也就是说每棵树训练出来都是有很大的差异的;而随机森林最后分类取决于多棵树(弱分类器)的投票表决。

API

  • class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, bootstrap=True, random_state=None, min_samples_split=2)
    • 随机森林分类器
    • n_estimators:integer,optional(default = 10)森林里的树木数量120,200,300,500,800,1200
    • criteria:string,可选(default =“gini”)分割特征的测量方法
    • max_depth:integer或None,可选(默认=无)树的最大深度 5,8,15,25,30
    • max_features="auto”,每个决策树的最大特征数量
      • If "auto", then max_features=sqrt(n_features).
      • If "sqrt", then max_features=sqrt(n_features) (same as "auto").
      • If "log2", then max_features=log2(n_features).
      • If None, then max_features=n_features.
    • bootstrap:boolean,optional(default = True)是否在构建树时使用放回抽样
    • min_samples_split:节点划分最少样本数
    • min_samples_leaf:叶子节点的最小样本数
  • 超参数:n_estimator, max_depth, min_samples_split,min_samples_leaf

代码

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
"""
# 随机森林就是多个树, 最后通过投票选择多数的那个决策
# 随机有两种方式
# 1: 每一个树训练集不同
# 2: 需要训练的特征进行随机分配 从特定的特征集里面抽取一些特征来分配
"""


def load_data():
data = pd.read_csv("titanic.csv")
titanic = data.copy()

# 方法一: 过滤掉空的值的数据组, 准确率高点
# data_used = titanic[["pclass", "age", "sex", "survived"]]
# real_data = pd.DataFrame(columns=["pclass", "age", "sex", "survived"])
# for row in data_used.values:
# if not np.isnan(row[1]):
# real_data = real_data.append([{'pclass': row[0], 'age': row[1],
# 'sex': row[2], 'survived': row[3]}],
# ignore_index=True)
# x = real_data[["pclass", "age", "sex"]].to_dict(orient="records")
# y = real_data["survived"]

# 方法二: 对空数据设置个非0值
x = titanic[["pclass", "age", "sex"]] # 只提取这一些特征
y = titanic["survived"] # 目标值
x["age"].fillna(x["age"].mean(), inplace=True)
x = x.to_dict(orient="records")

x_train, x_test, y_train, y_test = train_test_split(x, y.astype('int'), random_state=22)
return x_train, x_test, y_train, y_test


def titanic_ramdo_test():
x_train, x_test, y_train, y_test = load_data()

transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

estimator = RandomForestClassifier()
# 默认bootstrap 表示为true,也就是说默认情况下放回抽样

param_dict = {"n_estimators": [120, 200, 300, 500, 800, 1200],
"max_depth": [5, 8, 15, 25, 30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train) # 训练集里面的数据和目标值

# 传入测试值通过前面的预估器获得预测值
y_predict = estimator.predict(x_test)
print("预测值为:", y_predict, "\n真实值为:", y_test, "\n比较结果为:", y_test == y_predict)
score = estimator.score(x_train, y_train)
print("准确率为: ", score)
# ------------------
print("最佳参数:\n", estimator.best_params_)
print("最佳结果:\n", estimator.best_score_)
print("最佳估计器:\n", estimator.best_estimator_)
print("交叉验证结果:\n", estimator.cv_results_)

return None


if __name__ == '__main__':
titanic_ramdo_test()


结果

预测值为: [0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0
0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1]
真实值为: 831 0
261 0
1210 0
1155 0
255 1
..
1146 0
1125 1
386 0
1025 1
337 1
Name: survived, Length: 329, dtype: int32
比较结果为: 831 True
261 True
1210 True
1155 True
255 True
...
1146 True
1125 False
386 True
1025 False
337 True
Name: survived, Length: 329, dtype: bool
准确率为: 0.8556910569105691
最佳参数:
{'max_depth': 5, 'n_estimators': 200}
最佳结果:
0.8373983739837398
最佳估计器:
RandomForestClassifier(max_depth=5, n_estimators=200)
交叉验证结果:
{'mean_fit_time': array([0.1512816 , 0.25164779, 0.3653659 , 0.59906618, 0.9846971 ,
1.5066336 , 0.16621105, 0.29521887, 0.42253629, 0.70842552,
1.12597275, 1.67518139, 0.17654061, 0.29887748, 0.47307809,
0.78291416, 1.23968085, 1.91818857, 0.19313955, 0.31615305,
0.50930373, 0.84141755, 1.27824593, 1.95342708, 0.20079589,
0.32812317, 0.51229493, 0.8038369 , 1.27159723, 2.06746443]), 'std_fit_time': array([0.00618176, 0.01674228, 0.00287362, 0.02233938, 0.02145797,
0.0500205 , 0.006941 , 0.02374298, 0.02001825, 0.01697455,
0.02561221, 0.02773495, 0.00706825, 0.0097203 , 0.01953892,
0.03003017, 0.02405401, 0.0600764 , 0.01076438, 0.00826463,
0.03126825, 0.04336479, 0.04766072, 0.0891865 , 0.0072951 ,
0.01954262, 0.03557324, 0.01625109, 0.04509546, 0.10045004]), 'mean_score_time': array([0.0136106 , 0.02394764, 0.03123895, 0.05086056, 0.08178171,
0.1479373 , 0.01330829, 0.02061256, 0.03191408, 0.05386551,
0.07881149, 0.12000084, 0.01428199, 0.0236036 , 0.03556124,
0.05583986, 0.09441503, 0.14031251, 0.01430535, 0.02460194,
0.03723288, 0.06083274, 0.09740559, 0.13731035, 0.01529272,
0.02394597, 0.03588239, 0.05917549, 0.11535589, 0.15924136]), 'std_score_time': array([4.55860355e-04, 1.42495688e-03, 4.77847671e-04, 1.43162038e-03,
1.62907906e-03, 3.30000195e-02, 9.48899495e-04, 4.71360921e-04,
9.79807218e-07, 2.83507804e-03, 7.99722526e-04, 1.86659964e-03,
1.88923948e-03, 9.40436806e-04, 1.69784821e-03, 2.80456373e-03,
3.29008749e-03, 8.03975026e-03, 4.62616384e-04, 1.24356052e-03,
1.24470740e-03, 7.24240892e-03, 2.34932205e-03, 7.31379512e-03,
9.65791932e-04, 8.13474906e-04, 8.27468889e-04, 9.40380665e-04,
3.75948645e-02, 3.26989808e-02]), 'param_max_depth': masked_array(data=[5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, 15,
15, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False],
fill_value='?',
dtype=object), 'param_n_estimators': masked_array(data=[120, 200, 300, 500, 800, 1200, 120, 200, 300, 500, 800,
1200, 120, 200, 300, 500, 800, 1200, 120, 200, 300,
500, 800, 1200, 120, 200, 300, 500, 800, 1200],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'max_depth': 5, 'n_estimators': 120}, {'max_depth': 5, 'n_estimators': 200}, {'max_depth': 5, 'n_estimators': 300}, {'max_depth': 5, 'n_estimators': 500}, {'max_depth': 5, 'n_estimators': 800}, {'max_depth': 5, 'n_estimators': 1200}, {'max_depth': 8, 'n_estimators': 120}, {'max_depth': 8, 'n_estimators': 200}, {'max_depth': 8, 'n_estimators': 300}, {'max_depth': 8, 'n_estimators': 500}, {'max_depth': 8, 'n_estimators': 800}, {'max_depth': 8, 'n_estimators': 1200}, {'max_depth': 15, 'n_estimators': 120}, {'max_depth': 15, 'n_estimators': 200}, {'max_depth': 15, 'n_estimators': 300}, {'max_depth': 15, 'n_estimators': 500}, {'max_depth': 15, 'n_estimators': 800}, {'max_depth': 15, 'n_estimators': 1200}, {'max_depth': 25, 'n_estimators': 120}, {'max_depth': 25, 'n_estimators': 200}, {'max_depth': 25, 'n_estimators': 300}, {'max_depth': 25, 'n_estimators': 500}, {'max_depth': 25, 'n_estimators': 800}, {'max_depth': 25, 'n_estimators': 1200}, {'max_depth': 30, 'n_estimators': 120}, {'max_depth': 30, 'n_estimators': 200}, {'max_depth': 30, 'n_estimators': 300}, {'max_depth': 30, 'n_estimators': 500}, {'max_depth': 30, 'n_estimators': 800}, {'max_depth': 30, 'n_estimators': 1200}], 'split0_test_score': array([0.82926829, 0.82926829, 0.82621951, 0.83536585, 0.82926829,
0.83231707, 0.81402439, 0.80792683, 0.81402439, 0.81402439,
0.80792683, 0.81402439, 0.79573171, 0.79878049, 0.79573171,
0.79573171, 0.79573171, 0.79573171, 0.78963415, 0.79268293,
0.79573171, 0.79573171, 0.79268293, 0.79268293, 0.79573171,
0.80487805, 0.79878049, 0.79878049, 0.80182927, 0.79268293]), 'split1_test_score': array([0.85365854, 0.85670732, 0.85060976, 0.85060976, 0.85060976,
0.85060976, 0.85060976, 0.85670732, 0.85365854, 0.84756098,
0.85365854, 0.85365854, 0.84146341, 0.85365854, 0.84756098,
0.85365854, 0.85365854, 0.85060976, 0.84756098, 0.85060976,
0.85670732, 0.85365854, 0.85365854, 0.85365854, 0.85060976,
0.84756098, 0.85365854, 0.84756098, 0.85365854, 0.85670732]), 'split2_test_score': array([0.82621951, 0.82621951, 0.82621951, 0.82621951, 0.82621951,
0.82621951, 0.80487805, 0.80182927, 0.80182927, 0.80182927,
0.80182927, 0.80182927, 0.80487805, 0.80487805, 0.79573171,
0.80487805, 0.80487805, 0.80182927, 0.80182927, 0.80487805,
0.80182927, 0.80182927, 0.80487805, 0.80487805, 0.80182927,
0.80182927, 0.80487805, 0.80182927, 0.80182927, 0.80487805]), 'mean_test_score': array([0.83638211, 0.83739837, 0.83434959, 0.83739837, 0.83536585,
0.83638211, 0.82317073, 0.82215447, 0.82317073, 0.82113821,
0.82113821, 0.82317073, 0.81402439, 0.81910569, 0.81300813,
0.81808943, 0.81808943, 0.81605691, 0.81300813, 0.81605691,
0.81808943, 0.81707317, 0.81707317, 0.81707317, 0.81605691,
0.81808943, 0.81910569, 0.81605691, 0.81910569, 0.81808943]), 'std_test_score': array([0.01227952, 0.0137101 , 0.01149767, 0.01006046, 0.01085069,
0.01036386, 0.01975836, 0.02455904, 0.02212555, 0.01933567,
0.02312969, 0.02212555, 0.01975836, 0.02455904, 0.02443255,
0.02542682, 0.02542682, 0.02455904, 0.02493464, 0.02493464,
0.0274202 , 0.02598925, 0.02634447, 0.02634447, 0.02455904,
0.02087667, 0.02455904, 0.02231148, 0.02443255, 0.02775711]), 'rank_test_score': array([ 3, 1, 6, 1, 5, 3, 7, 10, 8, 11, 11, 8, 28, 14, 29, 16, 16,
24, 29, 24, 16, 21, 21, 21, 24, 16, 14, 24, 13, 16])}

总结

  • 在当前所有算法中,具有极好的准确率
  • 能够有效地运行在大数据集上,处理具有高维特征的输入样本,而且不需要降维
  • 能够评估各个特征在分类问题上的重要性