-
Notifications
You must be signed in to change notification settings - Fork 337
Description
Currently MLXLMCommon has some basic support for a cache, however it isn't persisted across calls to generate().
Even though it appears there could be a way to pass a KVCache to generate(), it ultimately must pass through the Sendable boundary if the app is to manage the cache. This isn't possible as MLXArray is not Sendable and also isn't desirable or necessary.
A prompt cache could be managed by the ModelContainer actor and stored in its context ModelContext.promptCache. Note that the prompt cache is an array of KVCache. In mlx_lm the PromptCache object also stores the token ids of the cached prompt and the model key to check if the model has changed.
We could implement a similar struct:
public struct PromptCache {
public let cache: [KVCache]
public let modelKey: String
public let tokens: MLXArray
}The PromptCache struct could also have functions for trimming.
Functions analogous to mlx_lm's get_prompt_cache could go in the ModelContainer actor.
I'm currently having a go at implementing this. Interested in any suggestions on the best approach.