# Neural network training

I need help with the training of my neura network. for some reason if i train it all weights always go to 0.

Main tab:


Network n = new Network(1, 1, 1);

float lr = 0.07;

float[] in = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1};
float[] out = {-0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.8, -0.9, -1};

void setup() {
size(800, 800);
}

void draw() {
background(0);
float[] guess = {0.1};
println(n.guess(guess));
}

void mousePressed() {
for (int i = 0; i < in.length; i++) {
float[] TIn = {in[i]};
float[] TOut = {out[i]};
n.train(TIn, TOut);
}
}



Network class:

class Network {
Neuron[] Layer1;
Neuron[] Layer2;
Neuron[] Layer3;

Network(int l1, int l2, int l3) {
Layer1 = new Neuron[l1];
Layer2 = new Neuron[l2];
Layer3 = new Neuron[l3];

for (int i = 0; i < Layer1.length; i++) {
Layer1[i] = new Neuron(l1, 8);
for (int j = 0; j < Layer1[i].w.length; j++) {
Layer1[i].w[j] = 1;
}
}
for (int i = 0; i < Layer2.length; i++) {
Layer2[i] = new Neuron(l1, 8);
}
for (int i = 0; i < Layer3.length; i++) {
Layer3[i] = new Neuron(l2, 8);
}
}

float[] guess(float[] input) {
float[] l1_guess = new float[Layer1.length];

for (int i = 0; i < l1_guess.length; i++) {
l1_guess[i] = Layer1[i].WSum(input);
}

float[] l2_guess = new float[Layer2.length];
for (int i = 0; i < l2_guess.length; i++) {
l2_guess[i] = Layer2[i].WSum(l1_guess);
}

float[] l3_guess = new float[Layer3.length];
for (int i = 0; i < l3_guess.length; i++) {
l3_guess[i] = Layer3[i].WSum(l2_guess);
}

return l3_guess;
}

void train(float[] inputs, float[] a) {
float[] guess = this.guess(inputs);
float[] errors = new float[a.length];

for (int i = 0; i < errors.length; i++) {
errors[i] = a[i] - guess[i];
}
/*println(guess);
println(inputs);
println(errors);*/

for (int i = 0; i < Layer3.length; i++) {
}

for (int i = 0; i < Layer2.length; i++) {

float sum = 0;
for (int j = 0; j < Layer3.length; j++) {
sum += errors[j] * Layer3[j].w[i];
}
}
}

float[][] Array2d(float[] array) {
float[][] result = new float[array.length][1];
for (int i = 0; i < array.length; i++) {
result[i][0] = array[i];
}
return result;
}
}



Neuron class:


class Neuron {
float[] w;
float SIG_a;

Neuron(int size, float a) {
w = new float[size];
SIG_a = a;

for (int i = 0; i < w.length; i++) {
w[i] = random(-1, 1);
}
}

int WSize() {
return w.length;
}

float WSum(float[] input) {
float sum = 0;

for (int i = 0; i < w.length; i++) {
sum += input[i] * w[i];
}
return sig(sum, SIG_a);
}

void train(float[] input, float a) {
float WSum = WSum(input);
float error = a - WSum;

for (int i = 0; i < w.length; i++) {
w[i] += error * input[i] * lr;
}
}

void trainError(float[] input,float error) {

for (int i = 0; i < w.length; i++) {
w[i] += error * input[i] * lr;
}
}
}

float sig(float x, float a) {
return (2 / (1 + exp(-x * a))) - 1;
}

float invSig(float x){
return x * (1 - x);
}

float sign(float n) {
return n >= 0? 1 : -1;
}


1 Like

I made a few tweaks that might help with debugging.

Network n = new Network(1, 1, 1);
int epoch = 0;

float lr = 0.07;

float[] in = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1};
float[] out = {-0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.8, -0.9, -1};

void setup() {
size(800, 800);
noLoop();
}

void draw() {
background(0);
float[] guess = {0.1};
println("Guess");
println(n.guess(guess));
println();
}

void mousePressed() {
epoch++;
println("Epoch " + epoch);

for (int i = 0; i < in.length; i++) {
float[] TIn = {in[i]};
float[] TOut = {out[i]};
n.train(TIn, TOut);
}

println("Layer1");
for (Neuron ron : n.Layer1) {
printArray(ron.w);
}
println("Layer2");
for (Neuron ron : n.Layer2) {
printArray(ron.w);
}
println("Layer3");
for (Neuron ron : n.Layer3) {
printArray(ron.w);
}

redraw();
}


How does state of your network during each epoch differ from what you expect to see?

1 Like

Layer2 approaches 0
Layer3 strays at around 0.56

The Error is probably here:

 void train(float[] inputs, float[] a) {
float[] guess = this.guess(inputs);
float[] errors = new float[a.length];

for (int i = 0; i < errors.length; i++) {
errors[i] = a[i] - guess[i];
}
/*println(guess);
println(inputs);
println(errors);*/

for (int i = 0; i < Layer3.length; i++) {
}

for (int i = 0; i < Layer2.length; i++) {

float sum = 0;
for (int j = 0; j < Layer3.length; j++) {
sum += errors[j] * Layer3[j].w[i];
}
}
}


I dont understand much about backpropagation.

1 Like

Iâ€™m still developing intuition for backpropagation myself. One of the simplest explanations Iâ€™ve seen is NanoNeuron by Oleksii Trekhleb, so I remixed it. Letâ€™s go!

## Data

The linear function weâ€™re going to learn is f(x)=1.8x+32 which converts temperatures from Celsius to Fahrenheit.

float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}


The training data set will consist of two arrays for x and y values.

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}


## Prediction

The neurons in the network are simple linear units. During forward propagation, each Neuron object multiplies its input by a weight and adds a bias.

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


The first network weâ€™re going to train consists of a single Neuron weâ€™ll call Layer1. Iâ€™m not going to apply an activation function like sigmoid, tanh, or ReLU, but Iâ€™m still going to call the output of Layer1 its activation a1.

class Network {
Neuron Layer1;

float a1 = 0;

Network() {
Layer1 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);

return a1;
}
}


## Cost

OK, so we can call forwardProp on a Network object and make a prediction. But Layer1â€™s weight was randomly initialized and its bias is 0 , so any prediction is probably trash. We need information on how badly a network is performing in order to nudge its weights and biases towards more optimal values. The measure of â€śbadnessâ€ť is called cost and we want to minimize it.

There are a handful of standard ways to calculate costâ€“weâ€™ll use a riff on mean squared error starting with C=\frac{1}{2}(a_{1}-y)^{2} for each training example. Since weâ€™re training on a data set with m=100 examples, the average cost after each epoch of training would be:

averageCost=\frac{1}{m} \displaystyle\sum_{i=1}^m{C_{i}}

Data
Prediction
Cost

Now how do we figure out which way to nudge our neurons? Cost is a function of the weights and biases in the networkâ€“and we want to minimize costâ€“so we need to move toward the minimum of the cost function.

Letâ€™s pretend the cost function looks like the blue parabola below. If we were at the point (3, -3) , then weâ€™d need to move left towards the minimum.

A tangent line Courtesy OpenStax

Imagine zooming into the point (3,-3) . No matter how closely you look, the orange tangent line y=2x-9 only intersects the blue line at precisely the point (3,-3) .

The slope of this particular tangent line is +2 ; this is our old friend rise over run \frac{\Delta{y}}{\Delta{x}} . At this particular point on the graph of f the slope is positive, so increasing the input x increases the output f(x) . If we want to minimize cost, then weâ€™d better decrease x by moving to the left.

## Calculus!

Itâ€™s time for a little bit of calculus. The slope of the line tangent to a function f at any given point is known as the derivative. The derivative of f with respect to a variable x is sometimes written \frac{df}{dx} or fâ€™(x) .

What is the derivative (slope) of a constant function like f(x)=5 ?

We know the derivative of a linear function f(x)=mx+b is \frac{df}{dx}=m . The first result from calculus Iâ€™ll use without any explanation whatsoever is this: the derivative of a quadratic function f(x)=ax^{2} is \frac{df}{dx}=2ax . The exponent 2 drops down and multiplies the base x along with whatever coefficient a happens to be out front.

If the function also had linear and constant terms f(x)=ax^{2}+bx+c then weâ€™d be looking at fâ€™(x)=2ax+b . The derivative of a sum is the sum of the derivatives.

Just for kicks, find the derivative of f(x)=x^{2}-4x , then input 3 .

### The Chain Rule

The second result from calculus Iâ€™m just going to use is called the chain rule. Some functions f can be written as the composition of two other functions g and h , as in f(x)=g(h(x)) . The original input x is first passed into h , then that functionâ€™s output h(x) becomes the input to g .

x -> h -> h(x) -> g -> g(h(x))


Given such a composition, the chain rule says:

fâ€™(x)=gâ€™(h(x))hâ€™(x)

Yikesâ€¦ letâ€™s see that in action. Iâ€™ll rewrite

f(x)=x^{2}-4x

as

f(x)=(x-2)^{2}-4

Now if we set g(x)=x^{2}-4 and h(x)=x-2 , we have f(x)=g(h(x))=(x-2)^{2}-4 .

Alright, moment of truth.

gâ€™(h(x))=2(x-2)
hâ€™(x)=1

So what is gâ€™(h(x))hâ€™(x) ? And does it equal the derivative you found moment ago?

## Backpropagation

Back to the Network, cost is a function of both w and b , so we need to figure out how cost changes with respect to each variable. More formally, we need to find the partial derivatives \frac{\partial{C}}{\partial{w}} and \frac{\partial{C}}{\partial{b}} . Letâ€™s follow one training example through the Network.

a_{1}=wx+b
C=\frac{1}{2}(a_{1}-y)^{2}

It looks like C could easily be rewritten as C=\frac{1}{2}(wx+b-y)^{2} . Remember, x and y are known training examples, not variables. The quantities weâ€™re actually changing are w and b , and weâ€™ll consider one at a time.

\frac{\partial{C}}{\partial{w}}=2\times{\frac{1}{2}}\times{(wx+b-y)}\times{x}
\frac{\partial{C}}{\partial{b}}=2\times{\frac{1}{2}}\times{(wx+b-y)}\times{1}

Letâ€™s simplify and streamline the notation a bit.

dw=(a_{1}-y)x
db=a_{1}-y

This is the heart of backpropagation. Here is a tiny implementation we can add to the Network class that uses a FloatDict for convenience.

  FloatDict backProp(float x, float y) {
grad.set("dw", (a1 - y) * x);

}


Weâ€™ll calculate both partial derivatives for each training example (x_{i},y_{i}) and sum them up, then weâ€™ll use their averages to update w and b like so.

w:=w-\alpha{dw}
b:=b-\alpha{db}

The partial derivatives form the Networkâ€™s gradient which points â€śuphillâ€ť for any given function. Since weâ€™re trying to minimize cost, we nudge our neurons in the opposite direction by subtracting the partial derivatives.

And that \alpha term? Itâ€™s the learning rateâ€“a multiplier that determines how fast we go downhill toward the minimum. Turns out this hyperparameter is a big deal.

  void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw = 0;
float db = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a1 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw /= m;
db /= m;
Layer1.w -= alpha * dw;
Layer1.b -= alpha * db;
}


## All together now

### Neuron

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


### Network

class Network {
Neuron Layer1;

float a1 = 0;
float averageCost = 0;

Network() {
Layer1 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);

return a1;
}

FloatDict backProp(float x, float y) {
grads.set("dw", (a1 - y) * x);

}

void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw = 0;
float db = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a1 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw /= m;
db /= m;
Layer1.w -= alpha * dw;
Layer1.b -= alpha * db;
}
}


### Sketch

// Based on NanoNeuron by Oleksii Trekhleb
float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}

Network n = new Network();
float alpha = 0.0005;
int epoch = 0;
FloatList cost = new FloatList();

void setup() {
size(100, 100);
generateDataset();
noLoop();
}

void draw() {
background(0);
for (int i = 0; i < 1000; i++) {
epoch++;
n.train(xTrain, yTrain);
cost.append(n.averageCost); // icanhaz visualization?
}
println("Epoch: " + epoch);
println("f(x) = " + n.Layer1.w + " * x + " + n.Layer1.b);
float x = 10;
float y = celsiusToFahrenheit(x);
println("Guess: " + n.forwardProp(x));
println("Expected: " + y);
println();
}

void mousePressed() {
redraw();
}

2 Likes

Layers? OK!

x -> f1 -> f1(x) -> f2 -> f2(f1(x))

a_1=w_1x+b_1
a_2=w_2a_1+b_2

The cost of each prediction in this case is now based on a_2 .

C=\frac{1}{2}(a_2-y)^2

Try applying the chain rule to come up with the partial derivatives for the weights and biases. Start from Layer2 and work backwards.

\frac{\partial{C}}{\partial{w_2}}=\frac{\partial{C}}{\partial{a_2}}\times\frac{\partial{a_2}}{\partial{w_2}}
\frac{\partial{C}}{\partial{b_2}}=\frac{\partial{C}}{\partial{a_2}}\times\frac{\partial{a_2}}{\partial{b_2}}

### Neuron

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


### Network

class Network {
Neuron Layer1;
Neuron Layer2;

float a1 = 0;
float a2 = 0;
float averageCost = 0;

Network() {
Layer1 = new Neuron();
Layer2 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);
a2 = Layer2.forwardProp(a1);

return a2;
}

FloatDict backProp(float x, float y) {
grad.set("dw2", (a2 - y) * a1);
grad.set("dw1", (a2 - y) * Layer2.w * x);
grad.set("db1", (a2 - y) * a1 * Layer2.w);

}

void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw2 = 0;
float db2 = 0;
float dw1 = 0;
float db1 = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a2 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw2 /= m;
db2 /= m;
dw1 /= m;
db1 /= m;
Layer2.w -= alpha * dw2;
Layer2.b -= alpha * db2;
Layer1.w -= alpha * dw1;
Layer1.b -= alpha * db1;
}
}


### Sketch

// Based on NanoNeuron by Oleksii Trekhleb
float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}

Network n = new Network();
float alpha = 0.00001;
int epoch = 0;
FloatList cost = new FloatList();

void setup() {
size(100, 100);
generateDataset();
noLoop();
}

void draw() {
background(0);
for (int i = 0; i < 1000; i++) {
epoch++;
n.train(xTrain, yTrain);
cost.append(n.averageCost); // icanhaz visualization?
}
println("Epoch: " + epoch);
println("f1(x) = " + n.Layer1.w + " * x + " + n.Layer1.b);
println("f2(x) = " + n.Layer2.w + " * x + " + n.Layer2.b);
float x = 10;
float y = celsiusToFahrenheit(x);
println("Guess: " + n.forwardProp(x));
println("Expected: " + y);
println();
}

void mousePressed() {
redraw();
}


If you read this far, get out of here and watch The Coding Train and 3Blue1Brown discuss neural networks in depth

2 Likes