neural network sparsity

=neural networks =machine learning

Why is it possible to train neural networks? Gradient descent works well for convex problems, but the problems that neural networks are used for have many poor local minima.

One approach to answering this question is to find some change to neural networks that doesn't break their functionality but does break training of them. Sparsity is such a change.

It's possible to remove most parameters from trained neural networks without significantly reducing performance. Iterative magnitude pruning, where a network repeatedly has small weights removed and gets retrained from an earlier state, can give similar performance with over 95% of parameters removed. However, which parameters can be removed is specific to the particular random initialization used. The existence of such sparse initial states is called the "lottery ticket hypothesis". Such good sparse initial states are very rare: in general, it's not possible to train sparse neural networks to get similar performance to what dense networks or pruned networks give.

minima

Suppose you have a local minimum with an obstacle that looks like case A. In 1D, it's hard to cross the hill to a lower point, but in 2D, it's probably possible to go around the hill.

Suppose the energy landscape instead looks like case B. Here, adding an extra dimension doesn't help.

In the case of neural networks, the energy landscape is somewhere between these extremes. Increasing dimensionality increases problem convexity, allowing better minima to be found, and my view is that this is key to neural networks reaching useful levels of performance. However, even neural networks with many times the minimum number of parameters needed don't reach global minima.

One implication of this is that it's possible to add and remove parameters repeatedly during training to allow a network to escape local minima. However, if a barrier in 1D becomes a wall in 2D, then adding a dimension hasn't helped - in other words, for extra dimensions to increase convexity, they can't be entirely orthogonal, which means they can't be cleanly removed without retraining.

Neural networks have a large capacity for memorization. The theoretical bound involves minimum distance between points in a high-dimensional space.

In other words, full utilization of a dense neural network involves evenly spaced points in the representation space. I believe sparsity can be considered a loss of memorization capacity in regions of the representation space, which is acceptable and even regularizing in practice because real items are clustered.

For example, for an animal-categorizing network, it's useful to have a cluster of points for different types of cats and different types of horses, but in that representation space there's a larger space between the "horse" and "cat" clusters that doesn't need to be used, and if an image is somehow placed in that space, then it's probably a mistake, so preventing that is regularizing.

Sparsity can also be generalized beyond certain parameters being zero:

- A mixture-of-experts system can be considered a larger neural network with adaptive sparsity.
- A spiking neural network can be considered temporal sparsity. This only reduces the number of computations needed, not the number of parameters, but the significance of the potential energy savings in biological neural networks from temporal sparsity is obvious.

Neural network sparsity is interesting for other reasons in addition to its significance for understanding dense neural networks:

- Biological neural networks are very sparse, so understanding the effects of sparsity could provide insights into them.
- Sparse neural networks can theoretically provide large improvements in efficiency if hardware is designed to support sparsity well.