【Python手写决策树并应对过度拟合问题】介绍决策树是一种用于监督学_傻大方

按关键词阅读：

Python手写决策树并应对过度拟合问题文章插图
介绍决策树是一种用于监督学习的算法。它使用树结构，其中包含两种类型的节点：决策节点和叶节点。决策节点通过在要素上询问布尔值将数据分为两个分支。叶节点代表一个类。训练过程是关于在具有特定特征的特定特征中找到"最佳"分割。预测过程是通过沿着路径的每个决策节点回答问题来从根到达叶节点。
基尼不纯度和熵术语"最佳"拆分是指拆分之后，两个分支比任何其他可能的拆分更"有序" 。我们如何定义更多有序的？这取决于我们选择哪种指标。通常，度量有两种类型：基尼不纯度和熵。这些指标越小，数据集就越"有序" 。
Python手写决策树并应对过度拟合问题文章插图
这两个指标之间的差异非常微妙。但在大多数应用中，两个指标的行为类似。以下是用于计算每个指标的代码。
def gini_impurity(y): # calculate gini_impurity given labels/classes of each example m = y.shape[0] cnts = dict(zip(*np.unique(y, return_counts = True))) impurity = 1 - sum((cnt/m)**2 for cnt in cnts.values()) return impurity def entropy(y): # calculate entropy given labels/classes of each example m = y.shape[0] cnts = dict(zip(*np.unique(y, return_counts = True))) disorder = - sum((cnt/m)*log(cnt/m) for cnt in cnts.values()) return disorder
构建树训练过程实质上是在树上建树。关键步骤是确定"最佳"分配。过程如下：我们尝试按每个功能中的每个唯一值分割数据，然后选择混乱程度最小的最佳数据。现在我们可以将此过程转换为python代码。
def get_split(X, y):# loop through features and values to find best combination with the most information gainbest_gain, best_index, best_value = http://kandian.youth.cn/index/0, None, Nonecur_gini = gini_impurity(y)n_features = X.shape[1]for index in range(n_features):values = np.unique(X[:, index], return_counts = False)for value in values:left, right = test_split(index, value, X, y)if left['y'].shape[0] == 0 or right['y'].shape[0] == 0:continuegain = info_gain(left['y'], right['y'], cur_gini)if gain > best_gain:best_gain, best_index, best_value = http://kandian.youth.cn/index/gain, index, valuebest_split = {'gain': best_gain, 'index': best_index, 'value': best_value}return best_splitdef test_split(index, value, X, y):# split a group of examples based on given index (feature) and valuemask = X[:, index] < valueleft = {'X': X[mask, :], 'y': y[mask]}right = {'X': X[~mask, :], 'y': y[~mask]}return left, rightdef info_gain(l_y, r_y, cur_gini):# calculate the information gain for a certain splitm, n = l_y.shape[0], r_y.shape[0]p = m / (m + n)return cur_gini - p * gini_impurity(l_y) - (1 - p) * gini_impurity(r_y)在构建树之前，我们先定义决策节点和叶节点。决策节点指定将在其上拆分的特征和值。它还指向左，右子项。叶节点包括类似于Counter对象的字典，该字典显示每个类有多少训练示例。这对于计算训练的准确性很有用。另外，它导致到达该叶子的每个示例的结果预测。
class Decision_Node:# define a decision nodedef __init__(self, index, value, left, right):self.index, self.value = http://kandian.youth.cn/index/index, valueself.left, self.right = left, rightclass Leaf:# define a leaf nodedef __init__(self, y):self.counts = dict(zip(*np.unique(y, return_counts = True)))self.prediction = max(self.counts.keys(), key = lambda x: self.counts[x])鉴于其结构，通过递归构造树是最方便的。递归的出口是叶节点。当我们无法通过拆分提高数据纯度时，就会发生这种情况。如果我们可以找到"最佳"拆分，则这将成为决策节点。然后，我们对其左，右子级递归执行相同的操作。
def decision_tree(X, y):# train the decision tree model with a datasetcorrect_prediction = 0def build_tree(X, y):# recursively build the treesplit = get_split(X, y)if split['gain'] == 0:nonlocal correct_predictionleaf = Leaf(y)correct_prediction += leaf.counts[leaf.prediction]return leafleft, right = test_split(split['index'], split['value'], X, y)left_node = build_tree(left['X'], left['y'])right_node = build_tree(right['X'], right['y'])return Decision_Node(split['index'], split['value'], left_node, right_node)root = build_tree(X, y)return correct_prediction/y.shape[0], root预测现在我们可以遍历树直到叶节点来预测一个示例。
def predict(x, node):if isinstance(node, Leaf):return node.predictionif x[node.index] < node.value:return predict(x, node.left)else:return predict(x, node.right)事实证明，训练精度为100％，决策边界看起来很奇怪！显然，该模型过度拟合了训练数据。好吧，如果考虑到这一点，如果我们继续拆分直到数据集变得更纯净，决策树将过度适合数据。换句话说，如果我们不停止分裂，该模型将正确分类每个示例！训练准确性为100％（除非具有完全相同功能的不同类别的示例），这丝毫不令人惊讶。

稿源：(未知)

【傻大方】网址：http://www.shadafang.com/c/111T314032020.html

标题：Python手写决策树并应对过度拟合问题

傻大方

Python手写决策树并应对过度拟合问题

扬子晚报|墙砖脱落砸坏违停车辆，谁该为6万元修车费“买单”？

每日经济新闻|古鳌科技：董事姜小丹、章祥余减持计划届满，未减持

「央视」菲律宾驻瑞典名誉总领事因感染新冠肺炎去世

为啥有些动画很受欢迎光盘卖的也很好，但就是不出第二季

盗墓笔记|盗墓笔记秦岭神树剧情解密原著的坑终于填上了

可爱|抓史莱姆的萝莉来了，看颜值和可爱度，萌王也会答应

百年战争英法百年战争是如何爆发的?

直播|诸暨主播太给力！光一样特产，线上销售就突破了100亿！

『天乐爱游戏』高热度英雄吕布为何胜率不高？提高胜率你需要学会合理开团

研究人员开发AI国际象棋引擎“玛雅” 可帮棋手识别技能错误