Meta Learning with
Memory-Augmented
Neural Networks
ICML 2016, citation: 16
Katy@DataLab
2017.03.28
Background
• Memory Augmented Neural Network(MANN) refers
to the class of eternal memory-equipped network
instead of those internal memory based
architecture(such as LSTMs)
Motivation
• Some problem of interest(ex: motor control) require
rapid inference from small quantities of data.
• This kind of flexible adaption is a celebrated aspect
of human learning.
Related Work
• Graves, Alex, Greg Wayne, and Ivo Danihelka.
"Neural turing machines." arXiv preprint arXiv:
1410.5401 (2014).
Related Work
• Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum.
"Human-level concept learning through probabilistic program induction."
Science 350.6266 (2015): 1332-1338.
Main Idea
• Learn to do classification on unseen class
• Learn the sample-class binding on memory
instead of weights
• Let the weights learn higher level knowledge
Model
• yt (label) is present in a temporally offset manner
• Labels are shuffled from dataset-to-dataset. This
prevent the network from slowly learning sample-
class binding.
• It must learn to hold data samples in memory until
the appropriate labels are presented at the next
time-step, after which sample-class information can
be bound and stored for later use
Model
Basically the same as neural turing machine(NTM)
• Read from memory using the same content-based
approach in NTM
• Write to memory using Least Recent Used
Access(LRUA)
• Least: Do you use this knowledge often?
• Recent: Do you just learn it?
Model
Content-based approach
controller produces the key
Least Recent Used
Access(LRUA)
• Usage weights wut keep track of the locations
most recently read or written to
• gamma is the decay parameter
• least-used weights wlu
t
• m(wu
t, m) denotes the n smallest element of the vector wu
t
• here we set n equals to the number of read
• write weights wwt
• alpha is a learnable parameter
• prior to writing to memory, the least used memory location
is set to zero
Experiments
• dataset: Omniglot
• 1643 classes with only
a few example per
class(the transpose of
MNIST)
• 1200 training classes
• 443 test classes
Experiments
• train for 100,000
episodes, each
episodes with five
randomly chosen
classes with five
randomly chosen
labels, and 10
instances each
• test on never-seen
classes
Machine v.s. Human
Class Representation
• A different approach for labeling classes was
employed so that the number of classes presented in
a given episode could be arbitrarily increased.
• Characters for each label were uniformly sampled
from the set {‘a’, ‘b’, ‘c’, ‘d’, ‘e’}, producing random
strings such as ‘ecdba’
Class Representation
• This combinatorial approach allows for 3125
possible labels, which is nearly twice the number of
classes in the dataset.
LSTM MANN
5 classes
/ episode
15 classes
/ episode
Experiment with Different
Algorithms
Experiment with Different
Algorithms
• kNN(single nearest neighbour) has an unlimited
amount of memory, and could automatically store
and retrieve all previously seen examples.
• MANN outperforms kNN
• using LSTM as controller is better than using
feedforward NN
Experiment on Memory
Wiping
• A good strategy is to wipe the external memory
from episode to episode, since each episode
contains unique classes with unique labels.
w/o wipping
with wipping
Experiment on Curriculum
Training
• gradually increase the classes per episode
Experiment on Curriculum
Training
Conclusion
• Gradual, incremental learning encodes
background knowledge that spans tasks, while a
more flexible memory resource binds information
particular to newly encountered tasks. (We wipe
out the external memory between episode in this
experiment)
• Demonstrate the ability of a memory-augmented
neural network to do meta-learning
• Introduce a new method a access external memory
Conclusion
• The controller is like the CPU/hippocampus(海⾺馬
體), in charges of the long term memory
• The external memory is like the RAM/neocortical(新
⽪皮層), in charges of the short term memory and the
new coming information
Conclusion
• As machine learning researchers, the lesson we
can glean from this is that it is acceptable for our
learning algorithms to suffer from forgetting, but
they may need complementary algorithms to
reduce the information loss.
Goodfellow, Ian J., et al. "An empirical investigation of catastrophic forgetting in gradient-based neural
networks." arXiv preprint arXiv:1312.6211 (2013).
Why Memory Augmented Neural
Network in general work well?
1. Information must be stored in memory in a
representation that is both stable (so that it can be
reliably accessed when needed) and element-
wise addressable (so that relevant pieces of infor-
mation can be accessed selectively).
2. The number of parameters should not be tied to
the size of the memory(LSTM doesn’t fulfil this).

Meta learning with memory augmented neural network

  • 1.
    Meta Learning with Memory-Augmented NeuralNetworks ICML 2016, citation: 16 Katy@DataLab 2017.03.28
  • 2.
    Background • Memory AugmentedNeural Network(MANN) refers to the class of eternal memory-equipped network instead of those internal memory based architecture(such as LSTMs)
  • 3.
    Motivation • Some problemof interest(ex: motor control) require rapid inference from small quantities of data. • This kind of flexible adaption is a celebrated aspect of human learning.
  • 4.
    Related Work • Graves,Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv: 1410.5401 (2014).
  • 6.
    Related Work • Lake,Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. "Human-level concept learning through probabilistic program induction." Science 350.6266 (2015): 1332-1338.
  • 7.
    Main Idea • Learnto do classification on unseen class • Learn the sample-class binding on memory instead of weights • Let the weights learn higher level knowledge
  • 8.
    Model • yt (label)is present in a temporally offset manner • Labels are shuffled from dataset-to-dataset. This prevent the network from slowly learning sample- class binding.
  • 9.
    • It mustlearn to hold data samples in memory until the appropriate labels are presented at the next time-step, after which sample-class information can be bound and stored for later use
  • 10.
    Model Basically the sameas neural turing machine(NTM)
  • 11.
    • Read frommemory using the same content-based approach in NTM • Write to memory using Least Recent Used Access(LRUA) • Least: Do you use this knowledge often? • Recent: Do you just learn it? Model
  • 12.
  • 13.
    Least Recent Used Access(LRUA) •Usage weights wut keep track of the locations most recently read or written to • gamma is the decay parameter
  • 14.
    • least-used weightswlu t • m(wu t, m) denotes the n smallest element of the vector wu t • here we set n equals to the number of read • write weights wwt • alpha is a learnable parameter • prior to writing to memory, the least used memory location is set to zero
  • 15.
    Experiments • dataset: Omniglot •1643 classes with only a few example per class(the transpose of MNIST) • 1200 training classes • 443 test classes
  • 16.
    Experiments • train for100,000 episodes, each episodes with five randomly chosen classes with five randomly chosen labels, and 10 instances each • test on never-seen classes
  • 17.
  • 18.
    Class Representation • Adifferent approach for labeling classes was employed so that the number of classes presented in a given episode could be arbitrarily increased. • Characters for each label were uniformly sampled from the set {‘a’, ‘b’, ‘c’, ‘d’, ‘e’}, producing random strings such as ‘ecdba’
  • 19.
    Class Representation • Thiscombinatorial approach allows for 3125 possible labels, which is nearly twice the number of classes in the dataset.
  • 20.
    LSTM MANN 5 classes /episode 15 classes / episode
  • 21.
  • 22.
    Experiment with Different Algorithms •kNN(single nearest neighbour) has an unlimited amount of memory, and could automatically store and retrieve all previously seen examples. • MANN outperforms kNN • using LSTM as controller is better than using feedforward NN
  • 23.
    Experiment on Memory Wiping •A good strategy is to wipe the external memory from episode to episode, since each episode contains unique classes with unique labels.
  • 24.
  • 25.
    Experiment on Curriculum Training •gradually increase the classes per episode
  • 26.
  • 27.
    Conclusion • Gradual, incrementallearning encodes background knowledge that spans tasks, while a more flexible memory resource binds information particular to newly encountered tasks. (We wipe out the external memory between episode in this experiment) • Demonstrate the ability of a memory-augmented neural network to do meta-learning • Introduce a new method a access external memory
  • 28.
    Conclusion • The controlleris like the CPU/hippocampus(海⾺馬 體), in charges of the long term memory • The external memory is like the RAM/neocortical(新 ⽪皮層), in charges of the short term memory and the new coming information
  • 29.
    Conclusion • As machinelearning researchers, the lesson we can glean from this is that it is acceptable for our learning algorithms to suffer from forgetting, but they may need complementary algorithms to reduce the information loss. Goodfellow, Ian J., et al. "An empirical investigation of catastrophic forgetting in gradient-based neural networks." arXiv preprint arXiv:1312.6211 (2013).
  • 30.
    Why Memory AugmentedNeural Network in general work well? 1. Information must be stored in memory in a representation that is both stable (so that it can be reliably accessed when needed) and element- wise addressable (so that relevant pieces of infor- mation can be accessed selectively). 2. The number of parameters should not be tied to the size of the memory(LSTM doesn’t fulfil this).