Neural Networks are very powerful supervised learning algorithms which can be used for Classification as well as Regression problems. In most cases, we have classification problem. It won't be wrong to say Neural Networks are the reason behind the hype of Machine Learning. Neural Networks or Artificial Neural Networks are also known as **Universal Function Approximator**. Read __this__ chapter of __neural networks and deep learning__ book. In this part, we will see what is neural networks, and how they work. Also, we will implement Neural Network using __TensorFlow__.

Neural Network is an information processing system, that is, we pass some input to the Neural Network, some processing happens and we get some output. Neural Networks are inspired from biological connection of neurons and how information processing happens in the brain. For more on biological neural networks see __here__ and __here__. Very simple architecture of a neural network is shown below.

Neural Network is made up of many neurons. We can think of each neuron as a single processing unit which accepts some input and produces some output. As we can see in above figure, there are three layers-*input layer, hidden layer and output layer*. You might have one question-why it is called hidden? The answer is- it's called hidden because it's not visible(preceded by input layer and followed by output layer). Each edge you see in figure has some weight associated with it. We can think of it as a importance of a particular feature.

We pass input vector/matrix to the input layer then it will be multiplied by a weight matrix. Then it will be passed through an **activation function** which is used to introduce non-linearity into the network. If we use sigmoid activation function then input value will be squashed in between [0,1] and in case of tanh - [-1,1]. Common choices for activation functions are :

Commonly used activation function is Relu since it has come nice property such as fast to converge and it also helps in regularization. Also, **SELU (Scaled Exponential Linear Unit)**, which recently came into the play, has started getting a lot of buzz. Check out __this Github repo__ for implementation and comparison of SELU with other activation functions.

Now, let us dive deeper into neural networks. Suppose, we have an input vector *x _{1},x_{2},x_{3}* and let us denote hidden units with

*h*and output units with

_{1}, h_{2 }, h_{3}, h_{4}*o*then

_{1},o_{2}

Here, *f* is the activation function. We will use RELU. These equations are pretty straight forward. Now, we have activations for hidden layer neurons, we need to find the output. In case of classification, we usually have C number of output neurons. Where C is the number of classes and each output neuron gives the probability of input belonging to a particular class. In this example, we have two classes and hence two neurons in output layer. Same way we can find activations for output layer. Now, how do we map output layer activations to the probabilities? For this, we use **softmax function**. Softmax function squashes K-dimensional vector of real values to K-dimensional vector of real values in range (0,1] that add up to 1. Softmax function can be written as :

Here, denominator acts as a normalizer such that output vector add up to 1. So, now we have our output. Please note, there are two pass in neural network algorithm- forward and backward. In forward pass, we calculate the output and in backward pass- we calculate the gradient of cost function with respect to the parameters of the neural network. But, how does learning work in Neural Network? We use gradient descent for learning in Neural Network and popular backpropagation algorithm to find gradients. Before that we need to define our cost function. Here, we will use cross-entropy as the cost function. We define cross-entropy cost function as:

If you remember, we update our parameters by subtracting gradient of the cost function w.r.t a particular parameter multiplied by a learning rate. Here, I am not going into the details of gradient calculation but its easy to understand if you have basic knowledge of derivation and chain rule. Check this out if you are curious. Essentially, what we do is- calculate gradient of cost function with respect to each parameter. Note that, for optimization purpose we do not deal with individual weight/parameter rather we use vectors or matrices. For i.e, we represent weights of input to hidden layer as a vector. Also, I omitted bias term in above equations for simplicity purpose. Now, we will implement a vanilla neural network using TensorFlow. If you haven't used TensorFlow before than check out __Github repo__

#read the datasetfrom tensorflow.examples.tutorials.mnist import input_datamnist = input_data.read_data_sets("MNIST_data", one_hot=True)#create placeholders to store input and output dataimport tensorflow as tfX = tf.placeholder(tf.float32, shape=[None, 784]) #28* 28 = 784y = tf.placeholder(tf.float32, shape=[None, 10]) #10 classes#create weights and biasw1 = tf.Variable(tf.truncated_normal([784, 50], stddev=0.5))b1 = tf.Variable(tf.ones([50]))#for hidden to output layerw2= tf.Variable(tf.truncated_normal([50,10], stddev=0.5))b2= tf.Variable(tf.ones([10]))h = tf.nn.relu(tf.matmul(X,w1)+b1)o = tf.nn.relu(tf.matmul(h, w2)+b2)#cost functioncost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = y, logits = o))step = tf.train.GradientDescentOptimizer(0.2).minimize(cost)#find accuracycorrect_prediction = tf.equal(tf.argmax(o,1), tf.argmax(y,1))accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))sess = tf.Session()init = tf.global_variables_initializer()sess.run(init)for i in range(30000): #increase the number of iterationstrain_data = mnist.train.next_batch(128)_, t_loss = sess.run([step, cost], feed_dict={X:train_data[0], y:train_data[1]})if i%500 == 0:acc = sess.run([accuracy], feed_dict={X:mnist.test.images, y:mnist.test.labels})print ("Step = {}, Accuracy = {}".format(i,acc))

Output:

Step = 0, Accuracy = [0.1367]Step = 500, Accuracy = [0.63599998]Step = 1000, Accuracy = [0.65100002]Step = 1500, Accuracy = [0.66369998]Step = 2000, Accuracy = [0.82440001]Step = 2500, Accuracy = [0.83740002]Step = 3000, Accuracy = [0.84259999]Step = 3500, Accuracy = [0.8488]Step = 4000, Accuracy = [0.85290003]Step = 4500, Accuracy = [0.85439998]Step = 5000, Accuracy = [0.85579997]......Step = 24500, Accuracy = [0.86970001]Step = 25000, Accuracy = [0.87040001]Step = 25500, Accuracy = [0.87010002]Step = 26000, Accuracy = [0.87110001]Step = 26500, Accuracy = [0.86970001]Step = 27000, Accuracy = [0.87]Step = 27500, Accuracy = [0.87]Step = 28000, Accuracy = [0.87080002]Step = 28500, Accuracy = [0.87080002]Step = 29000, Accuracy = [0.86940002]Step = 29500, Accuracy = [0.8696]

This was very simple neural network. Please note, we can make few changes in our implementation to get accuracy more than 95%(maybe 99%). Few tweaks to get higher accuracy are- use different optimizer(i.e Adam), use dropout(to prevent overfitting), learning rate decay (see __this__ for more), or use convolutional neural network ( CNN Tutorial ).