=neural networks =machine learning
Why is it possible to train
neural networks? Gradient descent works well for convex problems, but the
problems that neural networks are used for have many poor local minima.
One approach to answering this question is to find some change to neural
networks that doesn't break their functionality but does break training of
them. Sparsity is such a change.
It's possible to remove most parameters from trained neural networks without significantly reducing performance. Iterative magnitude pruning, where a network repeatedly has small weights removed and gets retrained from an earlier state, can give similar performance with over 95% of parameters removed. However, which parameters can be removed is specific to the particular random initialization used. The existence of such sparse initial states is called the "lottery ticket hypothesis". Such good sparse initial states are very rare: in general, it's not possible to train sparse neural networks to get similar performance to what dense networks or pruned networks give.
Suppose you have a local minimum
with an obstacle that looks like case A. In 1D, it's hard to cross the hill
to a lower point, but in 2D, it's probably possible to go around the hill.
Suppose the energy landscape instead looks like case B. Here, adding an
extra dimension doesn't help.
In the case of neural networks, the
energy landscape is somewhere between these extremes. Increasing
dimensionality increases problem convexity, allowing better minima to be
found, and my view is that this is key to neural networks reaching useful
levels of performance. However, even neural networks with many times the
minimum number of parameters needed don't reach global minima.
One
implication of this is that it's possible to add and remove parameters
repeatedly during training to allow a network to escape local minima.
However, if a barrier in 1D becomes a wall in 2D, then adding a dimension
hasn't helped - in other words, for extra dimensions to increase convexity,
they can't be entirely orthogonal, which means they can't be cleanly removed
without retraining.
Neural networks have a large
capacity for
memorization. The theoretical
bound involves minimum distance between points in a high-dimensional
space.
In other words, full utilization of a dense neural network
involves evenly spaced points in the representation space. I believe
sparsity can be considered a loss of memorization capacity in regions of the
representation space, which is acceptable and even regularizing in practice
because real items are clustered.
For example, for an
animal-categorizing network, it's useful to have a cluster of points for
different types of cats and different types of horses, but in that
representation space there's a larger space between the "horse" and "cat"
clusters that doesn't need to be used, and if an image is somehow placed in
that space, then it's probably a mistake, so preventing that is
regularizing.
Sparsity can also be generalized beyond certain parameters being zero:
- A
mixture-of-experts system can be considered a larger neural network with
adaptive sparsity.
- A spiking neural network can be considered temporal
sparsity. This only reduces the number of computations needed, not the
number of parameters, but the significance of the potential energy savings
in biological neural networks from temporal sparsity is obvious.
Neural network sparsity is interesting for other reasons in addition to its significance for understanding dense neural networks:
- Biological
neural networks are very sparse, so understanding the effects of sparsity
could provide insights into them.
- Sparse neural networks can
theoretically provide large improvements in efficiency if hardware is
designed to support sparsity well.