只有正样本和无标记数据的半监督学习（PU Learning）( 二 ) 作者：AlonAgmon编译：ronghuaiya

(3)使用我们在(1)上训练的分类器来估计k被标记或P(s=1|k)的概率。
(4)一旦我们估算出P(s=1|k) ，我们就可以将这个概率除以P(s=1|y=1)，这是在步骤(2)上估算出来的，这样就可以得到它属于这两类的实际概率。
我们现在写代码并进行测试以上步骤1-4可按如下方式实施：
# prepare datax_data = http://kandian.youth.cn/index/the training sety_data = target var (1 for the positives and not-1 for the rest)# fit the classifier and estimate P(s=1|y=1)classifier, ps1y1 =fit_PU_estimator(x_data, y_data, 0.2, Estimator())# estimate the prob that x_data is labeled P(s=1|X)predicted_s = classifier.predict_proba(x_data)# estimate the actual probabilities that X is positive# by calculating P(s=1|X) / P(s=1|y=1)predicted_y = estimated_s / ps1y1让我们从这里开始：fit_PU_estimator()方法。
fit_PU_estimator()方法完成了两个主要任务：它拟合一个分类器，你选择一个具有正样本和未标记样本的训练集，然后估计一个正样本被标记的概率。相应地，它返回拟合的分类器(学会估计给定样本被标记的概率)和估计的概率P(s=1|y=1) 。之后，我们需要做的就是找到P(s=1|x)或者标记为x的概率。因为这就是我们训练的分类器要做的，我们只需要调用它的predict_proba()方法。最后，为了实际对样本x进行分类，我们只需要将结果除以我们已经找到的P(s=1|y=1) 。这可以用代码表示为：
pu_estimator, probs1y1 = fit_PU_estimator(x_train,y_train,0.2,xgb.XGBClassifier())predicted_s = pu_estimator.predict_proba(x_train)predicted_s = predicted_s[:,1]predicted_y = predicted_s / probs1y1实现fit_PU_estimator()方法本身非常简单：
def fit_PU_estimator(X,y, hold_out_ratio, estimator):# The training set will be divided into a fitting-set that will be used# to fit the estimator in order to estimate P(s=1|X) and a held-out set of positive samples# that will be used to estimate P(s=1|y=1)# --------# find the indices of the positive/labeled elementsassert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"positives = np.where(y == 1.)[0]# hold_out_size = the *number* of positives/labeled samples# that we will use later to estimate P(s=1|y=1)hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))np.random.shuffle(positives)# hold_out = the *indices* of the positive elements# that we will later useto estimate P(s=1|y=1)hold_out = positives[:hold_out_size]# the actual positive *elements* that we will keep asideX_hold_out = X[hold_out]# remove the held out elements from X and yX = np.delete(X, hold_out,0)y = np.delete(y, hold_out)# We fit the estimator on the unlabeled samples + (part of the) positive and labeled ones.# In order to estimate P(s=1|X) orwhat is the probablity that an element is *labeled*estimator.fit(X, y)# We then use the estimator for prediction of the positive held-out set# in order to estimate P(s=1|y=1)hold_out_predictions = estimator.predict_proba(X_hold_out)#take the probability that it is 1hold_out_predictions = hold_out_predictions[:,1]# save the mean probabilityc = np.mean(hold_out_predictions)return estimator, cdef predict_PU_prob(X, estimator, prob_s1y1):prob_pred = estimator.predict_proba(X)prob_pred = prob_pred[:,1]return prob_pred / prob_s1y1为了测试这一点，我使用了[Bank Note Authentication dataset](+ Authentication) ，它基于从真钞和假钞图像中提取的4个数据点。第一次，我使用标记数据集上的分类器来设置一个基线，然后移除了75%的样本的标签，以测试在P&U数据集上执行的如何。如输出所示，这个的数据集不是最很难分类，但你可以看到，虽然PU分类器只是“知道”153个正样本，而其余1219个样本是没有标记的，它表现的和知道了所有的标记样本的分类器差不多。然而，它确实损失了17%的召回率，因此损失了相当多的正样本。不过无论怎样，相比于其他的方法，我相信这些结果是相当令人满意的。
===>> load data set <<===data size: (1372, 5)Target variable (fraud or not):07621610===>> create baseline classification results <<===Classification results:f1: 99.57%roc: 99.57%recall: 99.15%precision: 100.00%===>> classify on all the data set <<===Target variable (labeled or not):-112191153Classification results:f1: 90.24%roc: 91.11%recall: 82.62%precision: 99.41%一些重点。首先，这种方法的性能在很大程度上取决于数据集的大小。在本例中，我使用了大约150个正样本和1200个未标记样本。这远不是这种方法的理想数据集。例如，如果我们只有100个样本，我们的分类器就会表现得很差。其次，正如所附的notebook所示，有一些变量需要调优(例如要设置的样本大小、用于分类的概率阈值等) ，但最重要的可能是所选的分类器及其参数。我选择使用XGBoost是因为它在具有很少特征的小型数据集上执行得相对较好，但需要注意的是，它并不是在所有场景中都执行得最好，测试正确的分类器非常重要。