Wednesday, August 19, 2009

Delta Rule

Also known by the names:
Adaline Rule
Widrow-Hoff Rule
Least Mean Squares (LMS) Rule
Change from Perceptron:
Replace the step function in the with a continuous (differentiable) activation function, e.g linear
For classification problems, use the step function only to determine the class and not to update the weights.


Note: this is the same algorithm we saw for regression. All that really differs is how the classes are determined.

Delta Rule:
Training by Gradient Descent Revisited Construct a cost function E that measures how well the network has learned.


For example
(one output node)


where
n = number of examples
ti = desired target value associated with the i-th example
yi = output of network when the i-th input pattern is presented to network
To train the network, we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent.

Algorithm
Initialize the weights with some small random value
Until E is within desired tolerance, update the weights according towhere E is evaluated at W(old), m is the learning rate.: and the gradient is





More than Two Classes.
If there are mor ethan 2 classes we could still use the same network but instead of having a binary target, we can let the target take on discrete values. For example of there ar 5 classes, we could have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easier time if we have one output for class. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't).




Long Short-Term Memory

In a recurrent network, information is stored in two distinct ways. The activations of the units are a function of the recent history of the model, and so form a short-term memory. The weights too form a memory, as they are modified based on experience, but the timescale of the weight change is much slower than that of the activations. We call those a long-term memory. The Long Short-Term Memory model [1] is an attempt to allow the unit activations to retain important information over a much longer period of time than the 10 to 12 time steps which is the limit of RTRL or BPTT models.
The figure below shows a maximally simple LSTM network, with a single input, a single output, and a single memory block in place of the familiar hidden unit.
This figure below shows a maximally simple LSTM network, with a single input, a single output, and a single memory block in place of the familiar hidden unit. Each block has two associated gate units (details below). Each layer may, of course, have multiple units or blocks. In a typical configuration, the first layer of weights is provided from input to the blocks and gates. There are then recurrent connections from one block to other blocks and gates. Finally there are weights from the blocks to the outputs. The next figure shows the details of the memory block in more detail.
The hidden units of a conventional recurrent neural network have now been replaced by memory blocks, each of which contains one or more memory cells. At the heart of the cell is a simple linear unit with a single self-recurrent connection with weight set to 1.0. In the absence of any other input, this connection serves to preserve the cell's current state from one moment to the next. In addition to the self-recurrent connection, cells receive input from input units and other cell and gates. While the cells are responsible for maintaining information over long periods of time, the responsibility for deciding what information to store, and when to apply that information lies with an input and output gating unit, respectively.
The input to the cell is passed through a non-linear squashing function (g(x), typically the logistic function, scaled to lie within [-2,2]), and the result is then multiplied by the output of the input gating unit. The activation of the gate ranges over [0,1], so if its activation is near zero, nothing can enter the cell. Only if the input gate is sufficiently active is the signal allowed in. Similarly, nothing emerges from the cell unless the output gate is active. As the internal cell state is maintained in a linear unit, its activation range is unbounded, and so the cell output is again squashed when it is released (h(x), typical range [-1,1]). The gates themselves are nothing more than conventional units with sigmoidal activation functions ranging over [0,1], and they each receive input from the network input units and from other cells.
Thus we have:
Cell output: ycj(t) is
ycj(t) = youtj(t) h(scj(t))
where youtj(t) is the activation of the output gate, and the state, scj(t) is given by
scj(0) = 0, and
scj(t) = scj(t-1) + yinj(t) g(netcj(t)) for t > 0.
This division of responsibility---the input gates decide what to store, the cell stores information, and the output gate decides when that information is to be applied---has the effect that salient events can be remembered over arbitrarily long periods of time. Equipped with several such memory blocks, the network can effectively attend to events at multiple time scales.
Network training uses a combination of RTRL and BPTT, and we won't go into the details here. However, consider an error signal being passed back from the output unit. If it is allowed into the cell (as determined by the activation of the output gate), it is now trapped, and it gets passed back through the self-recurrent connection indefinitely. It can only affect the incoming weights, however, if it is allowed to pass by the input gate.
On selected problems, an LSTM network can retain information over arbitrarily long periods of time; over 1000 time steps in some cases. This gives it a significant advantage over RTRL and BPTT networks on many problems. For example, a Simple Recurrent Network can learn the Reber Grammar, but not the Embedded Reber Grammar. An RTRL network can sometimes, but not always, learn the Embedded Reber Grammar after about 100 000 training sequences. LSTM always solves the Embedded problem, usually after about 10 000 sequence presentations.
One of us is currently training LSTM networks to distinguish between different spoken languages based on speech prosody (roughly: the melody and rhythm of speech).
References
Hochreiter, Sepp and Schmidhuber, Juergen, (1997) "Long Short-Term Memory", Neural Computation, Vol 9 (8), pp: 1735-1780

Monday, August 3, 2009

The project of teaching C Language.

Dear my friends…

A few months ago, I designed an expert system in spring & summer 2009. It was offered by my master (Dr.Montazeri). It is a virtual teacher. I designed it for the course of expert systems.

Explanations:
This program is written by C# language. It learns C language and uses from Access Data Base. In this program, you have one account that your information is registered in it.
After your entering, the program dispatches your information from data base and starts learning. In teaching, you can stop or continue your learning and ask from the virtual teacher some questions about the same step (lesson).

For understanding more run it…

This program can be a start of designing virtual teachers. I tried to product this program usefully and application oriented.
At the end, I thank my dear master (Dr.Montazeri) that he helped me with Artificial Intelligence & Expert Systems.


I uploaded this program for you. If you are eager to upgrade and see its codes (In C#) please contact me to sending for you …

Download (EXE file)