小说生成大模型

很难受,初步搭建了一个小说生成的大模型框架,想法是用prompt_config来联系每一次迭代时候的大模型,在prompt_config中设置如下内容:

1
2
3
4
5
6
7
8
9
10
11
prompt_config = {
"model_prompt": "Write a continuation of the novel LLM, focusing on the unfolding events and character developments.",
"background": "In a dystopian future, where society is divided by technology and natural resources are dwindling,",
"characters": {
"Alex": "a tech-savvy rebel fighting against the oppressive regime",
"Jordan": "a loyalist to the regime, torn between duty and a growing sense of injustice",
"Mia": "a mysterious figure with knowledge that could tip the scales of power"
},
"history_summary": "After a daring raid on a regime supply depot, Alex and their group find themselves pursued by an elite force led by Jordan. Amidst the chaos, they encounter Mia, who reveals a secret that could change everything.",
"preview_summary": "The trio must navigate their conflicting loyalties and the dangers of a society on the brink of collapse to bring hope to the oppressed."
}

只要用这个做成一个prompt传给大模型,大模型应该就不需要非常巨大的上文检索了,因为每次都是只在prompt_config中构建prompt,每次传入的prompt应该都是差不多大小的,大模型如果首次可以生成文章,那么我觉得大模型就可以依据这个prompt_config生成任意长度的文章了。

但是还是显存不够

小说生成的流程主要如下:

  1. 首先用户设定prompt_config,也就是故事背景和人物,已经发生的事情和将要发生的事情
  2. 然后使用prompt_config生成prompt
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    def build_prompt(prompt_config):
    prompt = prompt_config["model_prompt"]
    prompt += f" here is the background you need to follow, background is {prompt_config['background']},"
    prompt += f" and there is a history story: {prompt_config['history_summary']}."
    prompt += f" the story will going on as: {prompt_config['preview_summary']}."
    prompt += " here are some characters you may use in the story"
    for character, description in prompt_config["characters"].items():
    prompt += f" {character} is a {description},"
    # Removing the last comma for proper grammar
    prompt = prompt.rstrip(',')
    return prompt
  3. 将prompt传入模型让它进行小说生成,然后依据生成的小说总结已经发生的故事并写出未来将要写的内容
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    def write_novel(model, prompt_config, iteration_num):
    prompt = build_prompt(prompt_config)
    novel = ""

    for i in tqdm(range(iteration_num)):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(
    input_ids,
    max_new_tokens=1000, # 设定最大长度
    temperature=0.9, # 调整创造性
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2, # 减少重复内容
    pad_token_id=tokenizer.eos_token_id
    )
    generate = tokenizer.decode(output[0], skip_special_tokens=True) + "\n"
    novel += generate
    summ = [summarize_story(model, generate), generate_preview(model, generate)]
    prompt_config = upgrade_prompt(summ, prompt_config)
    prompt = build_prompt(prompt_config)

    return novel
  4. 总结文章和预测下文都使用当前迭代生成的小说正文

    为了避免大模型概括的一坨,总结文章就只传入了正文而没有之前的历史信息

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    def summarize_story(model, context):
    prompt = f"there was a text that you should summarize it, you should summarize what is going on, you dont need to mention this prompt here is the text = {context}"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(
    input_ids,
    max_new_tokens=500, # 设定最大长度
    temperature=0.3, # 调整创造性
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2, # 减少重复内容
    pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    split_text = "**Summary:**"
    return extract_summary(generated_text, split_text)

    def generate_preview(model, context):
    prompt = f"there was a text that you should tell me what will happen in the futuer, text is {context}"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(
    input_ids,
    max_new_tokens=500, # 设定最大长度
    temperature=0.7, # 调整创造性
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2, # 减少重复内容
    pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    # split_text = "**Summary:**"
    # return extract_summary(generated_text, split_text)
    return generated_text

    def upgrade_prompt(summarize, prompt_config):
    prompt_config["history_summary"] = summarize[0]
    prompt_config["preview_summary"] = summarize[1]
    return prompt_config

    因为大模型的输出包括很多部分,因此需要对指定内容进行提取
    1
    2
    3
    4
    5
    6
    7
    8
    9
    def extract_summary(text, split_text):
    # Find the position where "Summary:" occurs
    start_index = text.find(split_text)
    if start_index != -1:
    # Extract everything after "Summary:"
    summary_text = text[start_index + len(split_text):].strip()
    return summary_text
    else:
    return f"{split_text} not found."

明明每次传入的信息都只有一个prompt_config,但是提示输入的大小越来越大了,而且最后直接显存报不够了

不是很明白问题到底出在哪

要吃饭了我决定下次有空再优化一下,下午看看论文,看看我能不能想到别的有趣的东西做。

关于一点prompt learning和想法

下午也没干什么,
就是又在看大模型的教学视频
讲的是prompt learning的东西,说是这个方法在80B以上的模型进行微调会比较好
然后prompt有人工生成模板和自动生成模板的放法,具体我也不了解
我感觉要看这些东西我最好能自己手动写一个大模型出来,好像也有这方面的视频,明天开始可以看看。

这个大模型的教学视频我希望是这周能看完,关于论文什么的也就是研究方向的我希望是下个月开始渐进式看,然后这个月和下个月的目标就是看完教程和实践大模型。

todo

  • [ ]看完清华的大模型教学视频
  • [ ]找一些自己动手编写大模型的视频