  1. Architectural Diagram Of LSTM Network.

(1) Architectural Diagram Of LSTM Network.

(2) Inputs For LSTM Network.

  • An LSTM network takes three inputs as its parameter.
  1. Previous Time Stamp Cell State Value.
  2. Previous Time Stamp Hidden State Value.
  3. Current Time Stamp Input State.

(2) Outputs For LSTM Network.

  • An LSTM network has two outputs.
  1. Current Cell State Value.
  2. Current Hidden State Value.

(3) Core Idea Of LSTM Network.

  • An LSTM unit is like a box whose main purpose is to store some important words and give them as an output.
  • Which words to store is decided by the processing units inside it.
  • Inside the LSTM network, mainly two things are happening.
    • Update The Cell State.
    • Calculate The Hidden State.

(4) Components Inside LSTM Network.

ht and Ct

  • ht and Ct both are vectors and their dimensions are also equal.
  • Example ht = [0.1, 0.9, 0.5] and Ct = [0.21, 0.59, 0.35].


  • Xt is the input word that we are passing to the LSTM network. LSTM can’t understand English words we have to convert it into some vector format.
  • Xt can be of any dimension.

ft, It, ~Ct, Ot

  • ft, It, ~Ct, Ot are the internal vectors. The dimensions of these vectors are the same as those of the ht and Ct vectors.
  • For example, if ht and Ct are (3*1) vectors then ft, It, ~Ct, Ot will also be (3*1) vectors.

Point Wise Operation:

  • Pointwise operation is done between two vectors only.

Point Wise Multiplication:

  • Pointwise multiplication is like multiplying numbers between two vectors at the same position.

Point Wise Addition:

  • Pointwise addition is like adding numbers between two vectors at the same position.

Point Wise tanh:

  • Pointwise tanh is like applying the tanh function to each number of the vector.

Neural Network Layers:

  • There are dense neural network layers present inside the LSTM units.
  • The work of the Neural Network layer is to make decisions about which words to forget and add inside the cell state and hidden state.
  • It’s like a brain inside the LSTM unit.
  • The number of nodes inside the neural network is flexible, i.e. it will be decided by you. It is like a hyperparameter.
  • If you decide the number of nodes to 4 it will be the same for every dense neural network layer present inside the LSTM network.

(5) GATES Inside LSTM Network.

Forget Gate:

  • The first step in our LSTM is to decide what information we’re going to throw away from the cell state. 
  • This decision is made by a sigmoid layer called the “forget gate layer.” 
  • It looks at ht­-1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct1.
  • A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
  • Inside the ‘Forget Gate’ we have a dense neural network layer.
  • Which will decide which word to keep inside the cell state and which word to remove from the cell state.
  • The sigmoid layer outputs numbers between zero and one, describing how much each component should be let through.
  • A value of zero means “let nothing through,” while a value of one means “let everything through!”
  • Inside the Forget Gate, we are doing 2 operations.
    • Calculate ft. (It will decide whether need to keep the word or forget it.)
    •  Ct-1 * ft (It will do the forget operation.)

Calculate ft:

  • Suppose you decide the number of neurons in the dense neural network is 3.
  • Then the size of Ct-1 and ht-1 will also be 3.
  • Each node inside the dense layer will give an output between 0 and 1.
  • For every node, we will collect the output and form ft.
  • Here ft will decide which words to keep and which to forget from the cell state.
  • Suppose we got the output for ft = [0, 0.5, 1]

Calculate Ct-1 * ft:

  • Ct-1 * ft is called a pointwise operation between Ct-1 and ft.
  • Suppose Ct-1 = [0.82, 0.64, 0.53]
  • ft = [0, 0.5, 1]
  • Ct-1 * ft = [(0.82*0), (0.64*0.5), (0.53*1)] = [0, 0.32, 0.53]
  • A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”

Forget Gate Calculation:

Input Gate:

  • The next step is to decide what new information we’re going to store in the cell state.
  • This has two parts.
  • First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
  • Next, a tanh layer creates a vector of new candidate values, ~Ct, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
  • Input Gate works in 3 stages.
    • Calculate Candidate Cell State (~Ct). (The new word that may go into the cell state.)
    • Calculate “it”. ( It will decide whether to store the candidate value in the cell state.)
    • Calculate Ct. 

Output Gate:

  • Finally, we need to decide what we’re going to output. This output will be based on our cell state but will be a filtered version.
  • First, the values of the current state and previous hidden state are passed into the third sigmoid function.
  • Then the new cell state generated from the cell state is passed through the tanh function (to push the values between −1 and 1).
  • Both these outputs are multiplied point-by-point so that we only output the parts we decided to.
  • This hidden state is used for prediction.
  • The output for the current time step is dependent on the Current Cell State value.
  • ht is derived from the long-term memory or cell state Ct.
  • ht is calculated in two steps.
    • tanh(Ct) = Element wise tanh operation on Ct. It will bring values between [-1 1].
    • Calculate Ot to apply the filter on tanh(Ct).
    • Apply filter on tanh(Ct) by doing point-wise multiplication on Ot.
  • Finally, the new cell state and new hidden state are carried over to the next time step.


  • To conclude, the forget gate determines which relevant information from the prior steps is needed.
  • The input gate decides what relevant information can be added from the current step, and the output gates finalize the next hidden state.

