Three Controversial Hypotheses Concerning Computation in the Primate Cortex

  • single algorithm to dominate them all

Next-token prediction in LLM << meaningful representations for next-frame prediction on videos

Video Models that:

  1. solves complex problems
  2. perform intricate reasoning
  3. subtle inferences

Question: what does a vision model doing reasoning look like ?

LLM learn by copying humans behavior instead of the capabilities. Reconstruct the human mind through the shadow that it casts on the Internet. Very TOK vibe here.

Question: how do define capabilities ?

Difference between video and text is that text is mostly human generated while video is camera generated. Skip logical representation but copy human mental representations instead.

Questions: does LLM ever have capabilities ? if they do, can use LLM to describe scene to imbue video with language capabilities ? What does it mean to learn like human ? Build complex understanding from simple understanding ? Language as a placeholder. Neuro symbolic ? Memory.