XOR using MultiLayered Perceptron
28 Oct 2016Historically, almost every non-tree based ML algorithm created only linear separable spaces, and the XOR simply cannot be modeled with them. Enter the multilayerd perceptron (MLP) and everything changed. With an elegant chaining of linear combinations of inputs we can obtain almost any number of linear decision boundaries. So I thought I’d bulid a robust XOR gate using a simple MLP
x1 | x2 | y |
---|---|---|
0 | 0 | 0 |
1 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 0 |
The goal was to create a network which can learn these four points with as minimal setup as possible. And that would be one with one layer each for input, hidden and output nodes where the hidden layer has 2 neurons only.
To fit four points using a neural network is simple. Right?
It turns out, no. While it did classify correctly most of the times, there were instances when the network simply wouldn’t converge.
Why can’t it fit four points? Is Andrew Ng wrong when he claimed that a simple MLP can predict XOR? The answer: It depends on the way the weights are initialized. This will dictate, at which local minimum is the network going to converge.
To dig deeper into this I created a 3D graph of y vs (x1,x2) where z-axis is y, the output of our simple MLP. And depending on the weights the MLP will have a different linear separators and heights for each plateau.
The interesting observation is that a pair of plateaus on the opposite corners necessarily have highest and lowest z value. And this actually means a simple MLP can never fit an XOR gate since the ideal gate should have the opposite quadrants on same heights. This got me confused. Where does Andrew Ng’s claim fit into the picture? The answer is, the weights were built to fit the data. And this needn’t happen with random initializations.
This made me rethink my model. XOR is not about fitting four points, but working with the entire cartesian plane.
x1 | x2 | y |
---|---|---|
negative | negative | 0 |
positive | negative | 1 |
negative | positive | 1 |
positive | positive | 0 |
And this way, I generated thousands of pairs of real numbers (x1, x2) each between -10 and 10 and their corresponding y values. Almost every training instance gave only 85% accuracy.
This once again shows that a simple MLP cannot fit XOR data. The workaround is to increase the number of nodes in the hidden layer, and this in turn creates multiple plateaus which will only fit the data and not generalize