# Neural network training

I need help with the training of my neura network. for some reason if i train it all weights always go to 0.

Main tab:


Network n = new Network(1, 1, 1);

float lr = 0.07;

float[] in = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1};
float[] out = {-0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.8, -0.9, -1};

void setup() {
size(800, 800);
}

void draw() {
background(0);
float[] guess = {0.1};
println(n.guess(guess));
}

void mousePressed() {
for (int i = 0; i < in.length; i++) {
float[] TIn = {in[i]};
float[] TOut = {out[i]};
n.train(TIn, TOut);
}
}



Network class:

class Network {
Neuron[] Layer1;
Neuron[] Layer2;
Neuron[] Layer3;

Network(int l1, int l2, int l3) {
Layer1 = new Neuron[l1];
Layer2 = new Neuron[l2];
Layer3 = new Neuron[l3];

for (int i = 0; i < Layer1.length; i++) {
Layer1[i] = new Neuron(l1, 8);
for (int j = 0; j < Layer1[i].w.length; j++) {
Layer1[i].w[j] = 1;
}
}
for (int i = 0; i < Layer2.length; i++) {
Layer2[i] = new Neuron(l1, 8);
}
for (int i = 0; i < Layer3.length; i++) {
Layer3[i] = new Neuron(l2, 8);
}
}

float[] guess(float[] input) {
float[] l1_guess = new float[Layer1.length];

for (int i = 0; i < l1_guess.length; i++) {
l1_guess[i] = Layer1[i].WSum(input);
}

float[] l2_guess = new float[Layer2.length];
for (int i = 0; i < l2_guess.length; i++) {
l2_guess[i] = Layer2[i].WSum(l1_guess);
}

float[] l3_guess = new float[Layer3.length];
for (int i = 0; i < l3_guess.length; i++) {
l3_guess[i] = Layer3[i].WSum(l2_guess);
}

return l3_guess;
}

void train(float[] inputs, float[] a) {
float[] guess = this.guess(inputs);
float[] errors = new float[a.length];

for (int i = 0; i < errors.length; i++) {
errors[i] = a[i] - guess[i];
}
/*println(guess);
println(inputs);
println(errors);*/

for (int i = 0; i < Layer3.length; i++) {
}

for (int i = 0; i < Layer2.length; i++) {

float sum = 0;
for (int j = 0; j < Layer3.length; j++) {
sum += errors[j] * Layer3[j].w[i];
}
}
}

float[][] Array2d(float[] array) {
float[][] result = new float[array.length][1];
for (int i = 0; i < array.length; i++) {
result[i][0] = array[i];
}
return result;
}
}



Neuron class:


class Neuron {
float[] w;
float SIG_a;

Neuron(int size, float a) {
w = new float[size];
SIG_a = a;

for (int i = 0; i < w.length; i++) {
w[i] = random(-1, 1);
}
}

int WSize() {
return w.length;
}

float WSum(float[] input) {
float sum = 0;

for (int i = 0; i < w.length; i++) {
sum += input[i] * w[i];
}
return sig(sum, SIG_a);
}

void train(float[] input, float a) {
float WSum = WSum(input);
float error = a - WSum;

for (int i = 0; i < w.length; i++) {
w[i] += error * input[i] * lr;
}
}

void trainError(float[] input,float error) {

for (int i = 0; i < w.length; i++) {
w[i] += error * input[i] * lr;
}
}
}

float sig(float x, float a) {
return (2 / (1 + exp(-x * a))) - 1;
}

float invSig(float x){
return x * (1 - x);
}

float sign(float n) {
return n >= 0? 1 : -1;
}


1 Like

I made a few tweaks that might help with debugging.

Network n = new Network(1, 1, 1);
int epoch = 0;

float lr = 0.07;

float[] in = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1};
float[] out = {-0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.8, -0.9, -1};

void setup() {
size(800, 800);
noLoop();
}

void draw() {
background(0);
float[] guess = {0.1};
println("Guess");
println(n.guess(guess));
println();
}

void mousePressed() {
epoch++;
println("Epoch " + epoch);

for (int i = 0; i < in.length; i++) {
float[] TIn = {in[i]};
float[] TOut = {out[i]};
n.train(TIn, TOut);
}

println("Layer1");
for (Neuron ron : n.Layer1) {
printArray(ron.w);
}
println("Layer2");
for (Neuron ron : n.Layer2) {
printArray(ron.w);
}
println("Layer3");
for (Neuron ron : n.Layer3) {
printArray(ron.w);
}

redraw();
}


How does state of your network during each epoch differ from what you expect to see?

1 Like

Layer2 approaches 0
Layer3 strays at around 0.56

The Error is probably here:

 void train(float[] inputs, float[] a) {
float[] guess = this.guess(inputs);
float[] errors = new float[a.length];

for (int i = 0; i < errors.length; i++) {
errors[i] = a[i] - guess[i];
}
/*println(guess);
println(inputs);
println(errors);*/

for (int i = 0; i < Layer3.length; i++) {
}

for (int i = 0; i < Layer2.length; i++) {

float sum = 0;
for (int j = 0; j < Layer3.length; j++) {
sum += errors[j] * Layer3[j].w[i];
}
}
}


I dont understand much about backpropagation.

1 Like

I’m still developing intuition for backpropagation myself. One of the simplest explanations I’ve seen is NanoNeuron by Oleksii Trekhleb, so I remixed it. Let’s go!

## Data

The linear function we’re going to learn is f(x)=1.8x+32 which converts temperatures from Celsius to Fahrenheit.

float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}


The training data set will consist of two arrays for x and y values.

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}


## Prediction

The neurons in the network are simple linear units. During forward propagation, each Neuron object multiplies its input by a weight and adds a bias.

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


The first network we’re going to train consists of a single Neuron we’ll call Layer1. I’m not going to apply an activation function like sigmoid, tanh, or ReLU, but I’m still going to call the output of Layer1 its activation a1.

class Network {
Neuron Layer1;

float a1 = 0;

Network() {
Layer1 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);

return a1;
}
}


## Cost

OK, so we can call forwardProp on a Network object and make a prediction. But Layer1’s weight was randomly initialized and its bias is 0 , so any prediction is probably trash. We need information on how badly a network is performing in order to nudge its weights and biases towards more optimal values. The measure of “badness” is called cost and we want to minimize it.

There are a handful of standard ways to calculate cost–we’ll use a riff on mean squared error starting with C=\frac{1}{2}(a_{1}-y)^{2} for each training example. Since we’re training on a data set with m=100 examples, the average cost after each epoch of training would be:

averageCost=\frac{1}{m} \displaystyle\sum_{i=1}^m{C_{i}}

Data
Prediction
Cost

Now how do we figure out which way to nudge our neurons? Cost is a function of the weights and biases in the network–and we want to minimize cost–so we need to move toward the minimum of the cost function.

Let’s pretend the cost function looks like the blue parabola below. If we were at the point (3, -3) , then we’d need to move left towards the minimum.

A tangent line Courtesy OpenStax

Imagine zooming into the point (3,-3) . No matter how closely you look, the orange tangent line y=2x-9 only intersects the blue line at precisely the point (3,-3) .

The slope of this particular tangent line is +2 ; this is our old friend rise over run \frac{\Delta{y}}{\Delta{x}} . At this particular point on the graph of f the slope is positive, so increasing the input x increases the output f(x) . If we want to minimize cost, then we’d better decrease x by moving to the left.

## Calculus!

It’s time for a little bit of calculus. The slope of the line tangent to a function f at any given point is known as the derivative. The derivative of f with respect to a variable x is sometimes written \frac{df}{dx} or f’(x) .

What is the derivative (slope) of a constant function like f(x)=5 ?

We know the derivative of a linear function f(x)=mx+b is \frac{df}{dx}=m . The first result from calculus I’ll use without any explanation whatsoever is this: the derivative of a quadratic function f(x)=ax^{2} is \frac{df}{dx}=2ax . The exponent 2 drops down and multiplies the base x along with whatever coefficient a happens to be out front.

If the function also had linear and constant terms f(x)=ax^{2}+bx+c then we’d be looking at f’(x)=2ax+b . The derivative of a sum is the sum of the derivatives.

Just for kicks, find the derivative of f(x)=x^{2}-4x , then input 3 .

### The Chain Rule

The second result from calculus I’m just going to use is called the chain rule. Some functions f can be written as the composition of two other functions g and h , as in f(x)=g(h(x)) . The original input x is first passed into h , then that function’s output h(x) becomes the input to g .

x -> h -> h(x) -> g -> g(h(x))


Given such a composition, the chain rule says:

f’(x)=g’(h(x))h’(x)

Yikes… let’s see that in action. I’ll rewrite

f(x)=x^{2}-4x

as

f(x)=(x-2)^{2}-4

Now if we set g(x)=x^{2}-4 and h(x)=x-2 , we have f(x)=g(h(x))=(x-2)^{2}-4 .

Alright, moment of truth.

g’(h(x))=2(x-2)
h’(x)=1

So what is g’(h(x))h’(x) ? And does it equal the derivative you found moment ago?

## Backpropagation

Back to the Network, cost is a function of both w and b , so we need to figure out how cost changes with respect to each variable. More formally, we need to find the partial derivatives \frac{\partial{C}}{\partial{w}} and \frac{\partial{C}}{\partial{b}} . Let’s follow one training example through the Network.

a_{1}=wx+b
C=\frac{1}{2}(a_{1}-y)^{2}

It looks like C could easily be rewritten as C=\frac{1}{2}(wx+b-y)^{2} . Remember, x and y are known training examples, not variables. The quantities we’re actually changing are w and b , and we’ll consider one at a time.

\frac{\partial{C}}{\partial{w}}=2\times{\frac{1}{2}}\times{(wx+b-y)}\times{x}
\frac{\partial{C}}{\partial{b}}=2\times{\frac{1}{2}}\times{(wx+b-y)}\times{1}

Let’s simplify and streamline the notation a bit.

dw=(a_{1}-y)x
db=a_{1}-y

This is the heart of backpropagation. Here is a tiny implementation we can add to the Network class that uses a FloatDict for convenience.

  FloatDict backProp(float x, float y) {
grad.set("dw", (a1 - y) * x);

}


We’ll calculate both partial derivatives for each training example (x_{i},y_{i}) and sum them up, then we’ll use their averages to update w and b like so.

w:=w-\alpha{dw}
b:=b-\alpha{db}

The partial derivatives form the Network’s gradient which points “uphill” for any given function. Since we’re trying to minimize cost, we nudge our neurons in the opposite direction by subtracting the partial derivatives.

And that \alpha term? It’s the learning rate–a multiplier that determines how fast we go downhill toward the minimum. Turns out this hyperparameter is a big deal.

  void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw = 0;
float db = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a1 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw /= m;
db /= m;
Layer1.w -= alpha * dw;
Layer1.b -= alpha * db;
}


## All together now

### Neuron

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


### Network

class Network {
Neuron Layer1;

float a1 = 0;
float averageCost = 0;

Network() {
Layer1 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);

return a1;
}

FloatDict backProp(float x, float y) {
grads.set("dw", (a1 - y) * x);

}

void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw = 0;
float db = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a1 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw /= m;
db /= m;
Layer1.w -= alpha * dw;
Layer1.b -= alpha * db;
}
}


### Sketch

// Based on NanoNeuron by Oleksii Trekhleb
float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}

Network n = new Network();
float alpha = 0.0005;
int epoch = 0;
FloatList cost = new FloatList();

void setup() {
size(100, 100);
generateDataset();
noLoop();
}

void draw() {
background(0);
for (int i = 0; i < 1000; i++) {
epoch++;
n.train(xTrain, yTrain);
cost.append(n.averageCost); // icanhaz visualization?
}
println("Epoch: " + epoch);
println("f(x) = " + n.Layer1.w + " * x + " + n.Layer1.b);
float x = 10;
float y = celsiusToFahrenheit(x);
println("Guess: " + n.forwardProp(x));
println("Expected: " + y);
println();
}

void mousePressed() {
redraw();
}

2 Likes

Layers? OK!

x -> f1 -> f1(x) -> f2 -> f2(f1(x))

a_1=w_1x+b_1
a_2=w_2a_1+b_2

The cost of each prediction in this case is now based on a_2 .

C=\frac{1}{2}(a_2-y)^2

Try applying the chain rule to come up with the partial derivatives for the weights and biases. Start from Layer2 and work backwards.

\frac{\partial{C}}{\partial{w_2}}=\frac{\partial{C}}{\partial{a_2}}\times\frac{\partial{a_2}}{\partial{w_2}}
\frac{\partial{C}}{\partial{b_2}}=\frac{\partial{C}}{\partial{a_2}}\times\frac{\partial{a_2}}{\partial{b_2}}

### Neuron

class Neuron {
float w;
float b;

Neuron() {
w = random(-1, 1);
b = 0;
}

float forwardProp(float a) {
float z = w * a + b;

return z;
}
}


### Network

class Network {
Neuron Layer1;
Neuron Layer2;

float a1 = 0;
float a2 = 0;
float averageCost = 0;

Network() {
Layer1 = new Neuron();
Layer2 = new Neuron();
}

float forwardProp(float x) {
a1 = Layer1.forwardProp(x);
a2 = Layer2.forwardProp(a1);

return a2;
}

FloatDict backProp(float x, float y) {
grad.set("dw2", (a2 - y) * a1);
grad.set("dw1", (a2 - y) * Layer2.w * x);
grad.set("db1", (a2 - y) * a1 * Layer2.w);

}

void train(float[] x, float[] y) {
int m = x.length;
averageCost = 0;
float dw2 = 0;
float db2 = 0;
float dw1 = 0;
float db1 = 0;
for (int i = 0; i < m; i++) {
forwardProp(x[i]);
float predictionCost = 0.5 * pow(a2 - y[i], 2);
averageCost += predictionCost;
}
averageCost /= m;
dw2 /= m;
db2 /= m;
dw1 /= m;
db1 /= m;
Layer2.w -= alpha * dw2;
Layer2.b -= alpha * db2;
Layer1.w -= alpha * dw1;
Layer1.b -= alpha * db1;
}
}


### Sketch

// Based on NanoNeuron by Oleksii Trekhleb
float celsiusToFahrenheit(float x) {
float w = 1.8;
float b = 32;
float z = w * x + b;

return z;
}

float[] xTrain = new float[100];
float[] yTrain = new float[100];

void generateDataset() {
for (int i = 0; i < xTrain.length; i++) {
xTrain[i] = i;
yTrain[i] = celsiusToFahrenheit(xTrain[i]);
}
}

Network n = new Network();
float alpha = 0.00001;
int epoch = 0;
FloatList cost = new FloatList();

void setup() {
size(100, 100);
generateDataset();
noLoop();
}

void draw() {
background(0);
for (int i = 0; i < 1000; i++) {
epoch++;
n.train(xTrain, yTrain);
cost.append(n.averageCost); // icanhaz visualization?
}
println("Epoch: " + epoch);
println("f1(x) = " + n.Layer1.w + " * x + " + n.Layer1.b);
println("f2(x) = " + n.Layer2.w + " * x + " + n.Layer2.b);
float x = 10;
float y = celsiusToFahrenheit(x);
println("Guess: " + n.forwardProp(x));
println("Expected: " + y);
println();
}

void mousePressed() {
redraw();
}


If you read this far, get out of here and watch The Coding Train and 3Blue1Brown discuss neural networks in depth

2 Likes