记录一下使用hugging face llama推理时遇到的问题.

首先使用如下代码进行推理:

from transformers import AutoConfig, AutoModel
from transformers import LlamaModel, LlamaConfig, LlamaTokenizerFast, LlamaForCausalLM
from transformers import AutoTokenizer
from torchsummary import summary
import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
logger = logging.getLogger('transformers')
logger.setLevel(logging.DEBUG)

tokenizer = LlamaTokenizerFast.from_pretrained("/root/workspace/llama_test/llama-tokenizer")
prompt = "My name is Mariama, my favorite"
inputs = tokenizer(prompt, return_tensors="pt")
print(inputs)

config = AutoConfig.from_pretrained("/data/llama-65b-hf/") # 
config.torch_dtype = "float32"
config.use_cache = False
print(config)

model = LlamaForCausalLM.from_pretrained("/data/llama-65b-hf/", config=config)

print("model init!")
generate_ids = model.generate(inputs.input_ids, max_new_tokens=32)
print(generate_ids)

这里得到的输入为tensor([[ 1, 1619, 1024, 338, 1085, 2829, 29874, 29892, 590, 25448]]), attention_mask为tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

generate过程

加载模型的generation.json

GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 1,
  "max_new_tokens": 32,
  "pad_token_id": -1,
  "use_cache": false
}

配置最大长度

generation_config.max_length = generation_config.max_new_tokens + input_ids_length目前是32+10 = 42

greedy_search

根据输出策略, 进入greedy_search进行推理.

循环推理

准备输入

'input_ids':
tensor([[    1,  1619,  1024,   338,  1085,  2829, 29874, 29892,   590, 25448]])
'position_ids':
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
'past_key_values':
None
'use_cache':
False
'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

进入forward
decode layers. 输出hidden_states为torch.Size([1, 10, 8192])
执行lm head, 得到logits为torch.Size([1, 10, 32000])
取logits最后一个torch.Size([1, 32000])求最大概率这里我得到的是color. 而整个输出得到是# name is Ktha and and I friends color. 本来以为是输入<s> xxx 会得到xxxy这样, 然后取y作为一个输出. 现在看来其实前面部分也会被脑补一些.

hugging face llama使用

hugging face llama使用

generate过程