In the future, when we model long videos, we can densely sample frames instead of sampling. These dense frame pairs are a burden, but for layers, this is a blessing! An idea that has been around for more than a year has finally been realized. The author said in the past.Over the past year, the team has been developing a new architecture that has linear complexity and a stronger hidden state for modeling long contexts. This idea of training at test time has been studied for more than 20 years. I clearly remember asking myself to discuss it when I first started my postdoc.
This meeting was the starting point for japan mobile number this research. Sequence models store historical context in a hidden state. Layers like this compress into a fixed-size state over time. They are very efficient but their performance is limited by their expressiveness. The attention mechanism has a K cache that grows over time. This state does not compress any historical context but becomes increasingly expensive as the length of the context increases. The team members thought: in this case, why not compress the context into the weights of the model - just like processing Internet data? This "hidden state model" can keep the size fixed in time and greatly enhance the expressiveness.
The researchers used self-supervised learning to update the weights of the hidden state by performing a gradient descent for each k. When processing a sequence, the state has been "trained" on the k in its context window. It is worth noting that the hidden state only exists in one layer of the end-to-end architecture. Other components such as the K projection matrix are learned during pre-training using the standard cross-entropy objective function. So the end-to-end architecture is actually meta-learning to find the best way to compress the context so that it can better predict the next k, that is, "learning how to learn at test time".
The layer directly replaces the att
-
- Posts: 31
- Joined: Mon Dec 23, 2024 6:08 am