Feed-forward neural networks
Since the revival of neural networks in 2014 / 2015, they are used still more commonly for many types for classification and auto-association tasks.
Classification task – what is that ?
Classification task
An often used example is recognition of hand-written postal codes (numbers) on an envelope by a computer program. Mail-sorters for decades do the job of routing each letter to the correct city, and postal district. A mail-sorter performs a classification task, labeling each set of pixels, representing one digit, into one of 10 categories: 0,1,2,3,4,5,6,7,8,9.
What is the problem for a normal computer algorithm? Well, it requires an exact pixel-by-pixel match. Each of us humans writes each numeric digit a bit differently. My handwriting of the digit ‘7’ is different to the ‘7’ written by you. A completely digital algorithm would have to learn by example each possible pixel-pattern, associated with each digit, from all possible examples. If the bit-patterns match (representation of that digit in computer memory matches with the scanned and digitized image of that digit), then the digit ‘7’ is recognized.
Now, such an approach is deemed unworkable. Recognition of hand-written digits is unsuited for recognition by an exact algorithm.
Instead, computer science looked to statistics in the 60-ties of the previous century, and so the field of pattern recognition and patten classifiers were invented. The word ‘pattern’ relates to the 2-D pixel image associated with for example a hand-written digit, or a witten letter from the alphabet. The first algorithms were based on statistical discriminant analysis, or on the nearest neighbor algorithm. The algorithms compute the probability that the pattern (‘digit’) is of each the known 10 digits.
Choose the digit (0, 1, 2,…) with the highest probability, and use that as the recognized digit. For example, when P(pattern = ‘7’) = 0.81, then all the other 9 digits (0,1,2,3,4,5,6,8,9) are less likely than ‘7’. The classifier assigns the label ‘7’ to the pixel-pattern.
In the second half of the 80-ties, feed-forward neural networks were successfully trained to recognize postal codes, see for example neural network by LeCun.
Feed-forward neural classifier
As shown by the application developed by LeCun, feed-forward neural networks are suited as classifiers. A typical layout is shown here, and also below. For the postal code recognition task, the input-nodes (left) are the input-variables. The output-nodes are the categories to distinguish, such as the 10 hand-written digits.
How does a neural network classifier work – and why does it work?
The output of node j in a one-hidden layer feed-forward neural network, is computed from:
where H is the number of hidden nodes, and I the number of input variables (I would be, the total number of input pixels, for the postal code classifier task).
The output probability indicates how likely it is that the pixel-image belongs to category j. So j=0,1,…,9.
Below, a neural network is shown with 4 input nodes, 2 hidden nodes and 3 different outcomes.
Explanations – why this NN-outcome ?
Why is the input vector x most likely belong to a particular category j, with the maximal probability?
Explanations at two levels can be given:
- Input node influence
- Hidden node influence
The explanations of the first kind (1) are useful when the neural network is applied to meaningful, individual variables. Explanations of the second kind (2) make most sense, when the neural network is applied to a signal processing task, such as a convolutional neural network.
Both types of explanations are statistical in nature – they rely on statistical associations between the input variables (x) and the n categories to distinguish, j=0,..,n.
Determining the influence and meaning of the input variables has been addressed in explanation in neural networks.
The explanation of the outcome – the most likely class – is based on principles of causality, as outlined in the readings by Sosa.
The approach to explanation entails combining the principles of causality with the deep knowledge of the complete multivariate distributions pertaining to the input variables, x. P(c1,c2,… | x) = P(c1,c2,…) P(x | c1,c2,…) / P(x).
Hence, the complete multivariate distribution of x needs to be known. Besides, the rules of causation need to be implied too, the framework referred to below. These two aspects – the multivariate distribution and the rules of causation – must be joined in order for causal explanations to be generated from the outcome of a neural network.
You need to model the semantic relations in-between the input variables, and the input-output relations, in order for causal explanations to be possible.
Reference
E. Sosa. “Causation and conditionals,” Oxford Readings in Philosophy, Oxford University Press, Oxford, 1975.