Three Controversial Hypotheses Concerning Computation in the Primate Cortex
- single algorithm to dominate them all
Next-token prediction in LLM << meaningful representations for next-frame prediction on videos
Video Models that:
- solves complex problems
- perform intricate reasoning
- subtle inferences
Question: what does a vision model doing reasoning look like ?
LLM learn by copying humans behavior instead of the capabilities. Reconstruct the human mind through the shadow that it casts on the Internet. Very TOK vibe here.
Question: how do define capabilities ?
Difference between video and text is that text is mostly human generated while video is camera generated. Skip logical representation but copy human mental representations instead.
Questions: does LLM ever have capabilities ? if they do, can use LLM to describe scene to imbue video with language capabilities ? What does it mean to learn like human ? Build complex understanding from simple understanding ? Language as a placeholder. Neuro symbolic ? Memory.