NLP

encoder - decoder

text -> encoder -> embedding vectors (also called hidden state or context)

fixed length vs all encoder state

attention

encoder creates a series of states instead of a single hidden state. Using all the states would be a huge input for decoder. Instead, decoder assigns a different importance, weight, attention to each of encoder states.

This is much better but still sequential. Transformers make it parallel.

But how?

Self attention

transfer learning