XOR using MultiLayered Perceptron

Historically, almost every non-tree based ML algorithm created only linear separable spaces, and the XOR simply cannot be modeled with them. Enter the multilayerd perceptron (MLP) and everything changed. With an elegant chaining of linear combinations of inputs we can obtain almost any number of linear decision boundaries. So I thought I’d bulid a robust XOR gate using a simple MLP

x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0
Truth table of XOR

The goal was to create a network which can learn these four points with as minimal setup as possible. And that would be one with one layer each for input, hidden and output nodes where the hidden layer has 2 neurons only.

Simple MLP!

To fit four points using a neural network is simple. Right?

Loss on an XOR data. Done and dusted!

It turns out, no. While it did classify correctly most of the times, there were instances when the network simply wouldn’t converge.

Why you no converge to zero?

Why can’t it fit four points? Is Andrew Ng wrong when he claimed that a simple MLP can predict XOR? The answer: It depends on the way the weights are initialized. This will dictate, at which local minimum is the network going to converge.

To dig deeper into this I created a 3D graph of y vs (x1,x2) where z-axis is y, the output of our simple MLP. And depending on the weights the MLP will have a different linear separators and heights for each plateau.

There are four plateaus for a simple MLP

The interesting observation is that a pair of plateaus on the opposite corners necessarily have highest and lowest z value. And this actually means a simple MLP can never fit an XOR gate since the ideal gate should have the opposite quadrants on same heights. This got me confused. Where does Andrew Ng’s claim fit into the picture? The answer is, the weights were built to fit the data. And this needn’t happen with random initializations.

Notice how (0,1) and (1,0) are in the valley while (0,0) and (1,1) are on the plateaus

This made me rethink my model. XOR is not about fitting four points, but working with the entire cartesian plane.

x1 x2 y
negative negative 0
positive negative 1
negative positive 1
positive positive 0
Modified XOR

And this way, I generated thousands of pairs of real numbers (x1, x2) each between -10 and 10 and their corresponding y values. Almost every training instance gave only 85% accuracy.

Ideal XOR vs Computed.
The red circles on the right show the errors which a simple 2 node MLP cannot calculate.

This once again shows that a simple MLP cannot fit XOR data. The workaround is to increase the number of nodes in the hidden layer, and this in turn creates multiple plateaus which will only fit the data and not generalize