How to get a probability distribution over tokens in a huggingface model? : pytorch

I’m following this tutorial on getting predictions over masked words. The reason I’m using this one is because it seems to be working with several masked word simultaneously while other approaches I tried could only take 1 masked word at a time.

The code:

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

sentence = "Tom has fully ___ ___ ___ illness."


def get_prediction (sent):
    
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position ]

    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()

    list_of_list =[]
    for index,mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        list_of_list.append(words)
        print ("Mask ",index+1,"Guesses : ",words)
    
    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess+" "+j[0]
        
    return best_guess


print ("Original Sentence: ",sentence)
sentence = sentence.replace("___","<mask>")
print ("Original Sentence replaced with mask: ",sentence)
print ("n")

predicted_blanks = get_prediction(sentence)
print ("nBest guess for fill in the blank :::",predicted_blanks)

How can I get the probability distribution over the 5 tokens instead of the indices of them? That is, similarly to how this approach (that I used before but once I change to multiple masked tokens I get an error) gets the score as an output:

from transformers import pipeline

# Initialize MLM pipeline
mlm = pipeline('fill-mask')

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Read the rest of this {mask} to understand things in more detail'
result = mlm(phrase)

# Print result
print(result)

[{
    'sequence': 'Read the rest of this article to understand things in more detail',
    'score': 0.35419148206710815,
    'token': 1566,
    'token_str': ' article'
},...

Read more here: Source link