Components

ComponentArchitectural
Input ProcessingStandard Embedding Layer
Low-Level ModuleEncoder-Only Transformer Block
High-Level ModuleEncoder-Only Transformer Block
Output ProcessingStandard Linear Layer + Softmax

Responsibility

  1. H-Module โ†’ directs overall problem-solving strategy
  2. L-module โ†’ executes the intensive search or refinement

forward residual ? full self-attention Adaptive Computation Time Mechanism โ†’ control over H-Module and L-Module โ€˜recurrentโ€™ = cycle in graph

Deep Supervision

  • break forward passes into smaller passes โ† how ?
    • takes per segment step ?

Adaptive Computation Time (ACT)

  • How many segment to run module for ?
  • act as a RL agent ???????
    • Q-Head
      1. Q-Halt: expected reward if we stop thinking now and give the current answer
      2. Q-Continue: expected reward if we spend more computation and run for another segment
    • Decision Rule: Q-Halt > Q-Continue

The Recurrent Operations

  1. Low-Level Module takes T steps (T is definitely learnable)
  2. High-Level Module takes 1 steps
  3. Repeat

To Read

  1. Universal Transformer