The layer directly replaces the att

rifattryo.ut11 · Post by **rifattryo.ut11** » Mon Dec 23, 2024 10:38 am

In the future, when we model long videos, we can densely sample frames instead of sampling. These dense frame pairs are a burden, but for layers, this is a blessing! An idea that has been around for more than a year has finally been realized. The author said in the past.Over the past year, the team has been developing a new architecture that has linear complexity and a stronger hidden state for modeling long contexts. This idea of training at test time has been studied for more than 20 years. I clearly remember asking myself to discuss it when I first started my postdoc.

This meeting was the starting point for japan mobile number this research. Sequence models store historical context in a hidden state. Layers like this compress into a fixed-size state over time. They are very efficient but their performance is limited by their expressiveness. The attention mechanism has a K cache that grows over time. This state does not compress any historical context but becomes increasingly expensive as the length of the context increases. The team members thought: in this case, why not compress the context into the weights of the model - just like processing Internet data? This "hidden state model" can keep the size fixed in time and greatly enhance the expressiveness.

The researchers used self-supervised learning to update the weights of the hidden state by performing a gradient descent for each k. When processing a sequence, the state has been "trained" on the k in its context window. It is worth noting that the hidden state only exists in one layer of the end-to-end architecture. Other components such as the K projection matrix are learned during pre-training using the standard cross-entropy objective function. So the end-to-end architecture is actually meta-learning to find the best way to compress the context so that it can better predict the next k, that is, "learning how to learn at test time".