Skip to content

Token Decoding (Inference Loop) #190

@thejhh

Description

@thejhh

Take the output logits and decode the model’s prediction. Apply a softmax to the logits to obtain a probability distribution (or for simplicity, you can directly pick the argmax as the predicted token for greedy decoding). Convert the selected token ID back to a text string using the tokenizer from step 4. If the goal is to generate multi-token outputs (as is typical in language model inference), implement a generation loop: append the predicted token to the input sequence, and feed the last $N$ tokens (or the entire sequence if still under 4096 tokens) back into the model to compute the next token. Repeat this until an end-of-sequence token is produced or a desired length is reached. Ensure that the context length does not exceed 4096 tokens; if it does, you may need to drop the oldest tokens (for streaming generation). This step yields the final decoded text output from the model.

Metadata

Metadata

Assignees

Labels

bitnetBitNet implementationtask

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions