epoch|使用深度学习模型创作动漫故事，比较LSTM和GPT2的文本生成方法

这个项目的动机是想看看在短短的几年时间里NLP领域的技术已经走了多远，特别是当它涉及到生成创造性内容的时候。通过生成动画概要，我探索了两种文本生成技术，首先是使用相对陈旧的LSTM，然后使用经过微调的GPT2。

文章插图
在这篇文章中，您将看到AI创建这种废话开始的过程。。。A young woman capable : a neuroi laborer of the human , where one are sent back home ? after defeating everything being their resolve the school who knows if all them make about their abilities . however of those called her past student tar barges together when their mysterious high artist are taken up as planned while to eat to fight !这件艺术A young woman named Haruka is a high school student who has a crush on a mysterious girl named Miki. She is the only one who can remember the name of the girl, and she is determined to find out who she really is.为了能够了解本篇文章，你必须具备以下知识:

Python编程PytorchRNNs的工作原理Transformers好吧，让我们看一些代码!数据描述这里使用的数据是从myanimelist中抓取的，它最初包含超过16000个数据点，这是一个非常混乱的数据集。所以我采取以下步骤清理:删除了所有奇怪的动漫类型(如果你是一个动漫迷，你就会知道我在说什么)。每个大纲在描述的最后都包含了它的来源(例如:source: myanimelist, source: crunchyroll等)，所以我也删除了它。基于电子游戏、衍生品或改编的动画都有非常少的概要总结，所以我删除了所有少于30的概要，我也删除了所有包含“衍生品”、“基于”、“音乐视频”、“改编”的概要。这样处理的逻辑是，这些类型的动画不会真正让我们的模型有创意。我还删除了大纲字数超过300的动画。这只是为了使培训更容易(请查看GPT2部分以了解更多细节)。删除符号。一些描述也包含日文，所以这些也被删除了。LSTM方式传统的文本生成方法使用循环的LSTM单元。LSTM(长短期记忆)是专门设计来捕获顺序数据中的长期依赖关系的，这是常规的RNNs所不能做到的，它通过使用多个门来控制从一个时间步骤传递到另一个时间步骤的信息。直观地说，在一个时间步长，到达LSTM单元的信息经过这些门，它们决定是否需要更新信息，如果它们被更新，那么旧的信息就会被忘记，然后新的更新的值被发送到下一个时间步长。要更详细地了解LSTMs，可以浏览我们发布的一些文章。创建数据集因此，在我们构建模型架构之前，我们必须标记概要并以模型接受的方式处理它们。在文本生成中，输入和输出是相同的，只是输出标记向右移动了一步。这基本上意味着模型接受输入的过去的单词并预测下一个单词。输入和输出令牌分批传递到模型中，每个批处理都有固定的序列长度。我已经按照这些步骤来创建数据集:

创建一个配置类。将所有的概要合并在一起。标记对照表。定义批数。创建词汇，单词索引和索引到单词字典。通过向右移动输入标记来创建输出标记。创建一个生成器函数，它批量地输出输入和输出序列。# code courtesy: machinetalk/2019/02/08/text-generation-with-pytorch/class config:# stores the required hyerparameters and the tokenizer for easy accesstokenizer = nltk.word_tokenizebatch_size = 32seq_len = 30emb_dim = 100epochs = 15hidden_dim = 512model_path = 'lm_lrdecay_drop.bin'def create_dataset(synopsis,batch_size,seq_len):np.random.seed(0)synopsis = synopsis.apply(lambda x: str(x).lower()).valuessynopsis_text = ' '.join(synopsis)tokens = config.tokenizer(synopsis_text)global num_batchesnum_batches = int(len(tokens)/(seq_len*batch_size))tokens = tokens[:num_batches*batch_size*seq_len]words = sorted(set(tokens))w2i = {w:i for i,w in enumerate(words)}i2w = {i:w for i,w in enumerate(words)}tokens = [w2i[tok] for tok in tokens]target = np.zeros_like((tokens))target[:-1] = tokens[1:]target[-1] = tokens[0]input_tok = np.reshape(tokens,(batch_size,-1))target_tok = np.reshape(target,(batch_size,-1))print(input_tok.shape)print(target_tok.shape)vocab_size = len(i2w)return input_tok,target_tok,vocab_size,w2i,i2wdef create_batches(input_tok,target_tok,batch_size,seq_len):num_batches = np.prod(input_tok.shape)//(batch_size*seq_len)for i in range(0,num_batches*seq_len,seq_len):yield input_tok[:,i:i+seq_len], target_tok[:,i:i+seq_len]模型架构我们的模型由一个嵌入层、一堆LSTM层(我在这里使用了3个层)、dropout层和最后一个输出每个词汇表标记的分数的线性层组成。我们还没有使用softmax层，你很快就会明白为什么。因为LSTM单元也输出隐藏状态，所以模型也返回这些隐藏状态，以便在下一个时间步骤(下一批单词序列)中将它们传递给模型。此外，在每个epoch之后，我们需要将隐藏状态重置为0，因为在当前epoch的第一个time step中，我们不需要来自前一个epoch的最后一个time step的信息，所以我们也有一个“zero_state”函数。class LSTMModel(nn.Module):def __init__(self,hid_dim,emb_dim,vocab_size,num_layers=1):super(LSTMModel,self).__init__()self.hid_dim = hid_dimself.emb_dim = emb_dimself.num_layers = num_layersself.vocab_size = vocab_size+1self.embedding = nn.Embedding(self.vocab_size,self.emb_dim)self.lstm = nn.LSTM(self.emb_dim,self.hid_dim,batch_first = True,num_layers = self.num_layers)self.drop = nn.Dropout(0.3)self.linear = nn.Linear(self.hid_dim,vocab_size) # from here we will randomly sample a worddef forward(self,x,prev_hid):x = self.embedding(x)x,hid = self.lstm(x,prev_hid)x = self.drop(x)x = self.linear(x)return x,hiddef zero_state(self,batch_size):return (torch.zeros(self.num_layers,batch_size,self.hid_dim),torch.zeros(self.num_layers,batch_size,self.hid_dim))训练然后我们只需要定义训练函数，存储每个epoch的损失，并保存损失最大的模型。我们还在每个epoch之前调用零状态函数来重置隐藏状态。我们使用的损失函数是交叉熵损失，这就是为什么我们没有通过显式softmax层的输出，因为这个损失函数计算内部。所有的训练都是在GPU上完成的，下面是正在使用的参数(在config类中提供):