=explanation =programming =machine learning
Convolutional neural networks start with convolution: sliding
a small "kernel" over your image. Here
are some examples. This produces a greyscale filtered image of how well
that kernel matches each point. You do this for several kernels, which each
detect simple things like lines or corners.
This is done for many
features, and you can think of the output for each feature being like a
color of the output image, except there are more than 3 such "colors".
The next step is usually max-pooling. You take shrink the
filtered image, and each pixel becomes the max value of that area. Each
pixel now indicates something like "there is probably a line somewhere in
this area". Shrinking images like this not only reduces the amount of data
to process at the next step, it also gives some looseness to the next stage
of feature detection: you can now shift a feature by a pixel and the next
stage will get basically the same input.
The same process is
repeated, except now instead of detecting patterns of input pixels, you're
detecting patterns of features. For example, maybe the next stage ends up
detecting large approximate circles, by checking for circular patterns of
line or corner features. After each stage, the image is again shrunk with
max-pooling, providing a hierarchical looseness to detection of hierarchical
features.
A common optimization in modern systems is "1x1
convolution" where at each pixel, the vector of features is multiplied by a
matrix to reduce it to a smaller number of features. This works because
there's often enough correlation between features that they can be combined
into a smaller set of features.
You keep shrinking the image and
increasing the number of features, and eventually you've shrunk things to 1
pixel. At this point, you're using a non-convolutional neural network, often
called a "DNN" for "deep neural network" even if it's not very deep.
With a DNN, each set of features is a vector without any spatial meaning,
and it's multiplied by a matrix to get the next features set. But you could
just combine those matrices into a single matrix and multiply by that, so it
doesn't matter how many layers you have, you're still just doing a single
matrix multiplication. To avoid this, people run each feature through a
nonlinear function between the matrix multiplications. Traditionally this
has been a sigmoidal function, but more recently people have mainly used
"rectification" which is just setting the value to 0 if it's negative.
If you're trying to tell
what's in an image, this sort of system works relatively well, but you
shouldn't get overconfident about its capabilities. You're detecting certain
patterns of patterns, but if you're detecting something like "rough and
yellow in the middle, red on top" for a certain type of bird, the system
could be confused by a tomato on top of a pineapple, where a human wouldn't
be. If you put, say, stickers on things with features the neural network
strongly recognizes, it can wreck the recognition.
I remember talking
to a professor a few years back, telling me that neural networks had
surpassed human performance for detecting a type of cancer in pictures. I
was skeptical, but I didn't change his mind because there was a paper
published in a journal he trusted. It turned out the paper in question had a
neural network trained on human-labeled images with
inaccurate labels, so
when they asked doctors to look at the images, the doctors didn't match the
(inaccurate) labels as well as a neural network trained on those labels.
One rule of neural network performance is that they're better relative
to humans when things are smaller. For example, if you're detecting birds in
a photo of a beach, where the birds are just a few pixels in the sky, then
neural networks can do about as well as humans.
Similarly, neural
networks are adequate for board games like Chess or Go because the number of
squares and possible moves is relatively small. (But even then, while
AlphaZero is impressive work, it loses to specialized Chess programs when
processing power is limited and equal.) For larger state spaces, neural
networks become less effective.