NLP:词中的数学( 三 )

具体来说 , 某个词在给定文档中出现的次数称为词项频率 , 通常简写为TF 。 在某些例子中 , 可以将某个词的出现频率除以文档中的词项总数从而得到归一化的词项频率结果[1] 。
上面的例子中 , 排名最靠前的4个词项或词条分别是“the”“,”“harry”和“faster” , 但是“the”和标点符号“,”对文档的意图而言信息量不大 , 并且这些信息量不大的词条可能会在我们的快速探索之旅中多次出现 。 对本例来说 , 我们通过标准的英语停用词表和标点符号表来去掉这些词 。 后面我们不会总是这样做 , 但是现在这样做有助于问题的简化 。 因此 , 最后我们在排名靠前的词项频率向量(词袋)中留下了“harry”和“faster”这两个词条 。
接下来从上面定义的Counter对象(bag_of_words)中计算“harry”的词频 。
>>> times_harry_appears = bag_of_words['harry']>>> num_unique_words = len(bag_of_words)?--- 原始语句中的独立词条数>>> tf = times_harry_appears / num_unique_words>>> round(tf, 4)0.1818这里先暂停一下 , 我们更深入了解一下归一化词项频率这个贯穿本书的术语 。 它是经过文档长度“调和”后的词频 。 但是为什么要“调和”呢?考虑词“dog”在文档A中出现3次 , 在文档B中出现100次 。 显然 , “dog”似乎对文档B更重要 , 但是等等!这里的文档A只是一封写给兽医的30个词的电子邮件 , 而文档B却是包含大约580 000个词的长篇巨著《战争与和平》(War--tt-darkmode-color: #666666;">)!因此 , 我们一开始的分析结果应该正好反过来 , 即“dog”对文档A更重要 。 下列计算中考虑了文档长度:
TF(“dog, ” documentA ) = 3/30 = 0.1
TF(“dog, ” document___B___ ) = 100/580 000 = 0.000 17
现在 , 我们可以看到描述关于两篇文档的一些东西 , 以及这两篇文档和词“dog”的关系和两篇文档之间的关系 。 因此 , 我们不使用原始的词频来描述语料库中的文档 , 而使用归一化词项频率 。 类似地 , 我们可以计算每个词对文档的相对重要程度 。 显然 , 书中的主人公Harry及其对速度的要求是文档中故事的中心 。 我们已经做了很多的工作将文本转换成数值 , 而且超越了仅表示特定词出现与否的范围 。 当然 , 我们现在看到的只是一个人为的例子 , 但是通过这个例子我们能够快速看出基于该方法可能得到多么有意义的结果 。 下面考虑一个更长的文本片段 , 它来自维基百科中有关风筝(kite)的文章的前几个段落:
A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.
The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
——维基百科
然后 , 将该文本赋给变量:
>>> from collections import Counter>>> from nltk.tokenize import TreebankWordTokenizer>>> tokenizer = TreebankWordTokenizer()>>> from nlpia.data.loaders import kite_text?--- 和上面一样 , kite_text = “A kite is traditionally …”>>> tokens = tokenizer.tokenize(kite_text.lower())>>> token_counts = Counter(tokens)>>> token_countsCounter({'the': 26, 'a': 20, 'kite': 16, ',': 15, ...})