CNN Seq 2 Seq learning

6 min readJan 28, 2021

Background

Convolution Neural Network (CNNs) are typically used for image processing,but as more research is conducted CNNs have replaced RNNs in NLP tasks as well.

A convolutional layer uses filters. These filters have a width (and also a height in images, but usually not text). If a filter has a width of 3, then it can see 3 consecutive tokens. Each convolutional layer has many of these filters (1024 in this tutorial). Each filter will slide across the sequence, from beginning to the end, looking at all 3 consecutive tokens at a time. The idea is that each of these 1024 filters will learn to extract a different feature from the text. The result of this feature extraction will then be used by the model — potentially as input to another convolutional layer. This can then all be used to extract features from the source sentence to translate it into the target language.

The model is made of an encoder and decoder as shown in the image below. The encoder encodes the input sentence, in the source language, into a context vector. The decoder decodes the context vector to produce the output sentence in the target language.

Encoder

RNNs had an encoder that compresses an entire input sentence into a single context vector, z. The convolutional sequence-to-sequence model is a little different — it gets two context vectors for each token in the input sentence. So, if our input sentence had 6 tokens, we would get 12 context vectors, two for each token.

The two context vectors per token are a conved vector and a combined vector. The conved vector is the result of each token being passed through a few convolution layers. The combined vector comes from the sum of the convolved vector and the embedding of that token. Both of these are returned by the encoder to be used by the decoder.

The image below shows the result of an input sentence — zwei menschen fechten.— being passed through the encoder.

First, the token is passed through a token embedding layer — which is standard for neural networks in natural language processing. However, as there are no recurrent connections in this model it has no idea about the order of the tokens within a sequence. To rectify this we have a second embedding layer, the positional embedding layer. This is a standard embedding layer where the input is not the token itself but the position of the token within the sequence — starting with the first token, the `<sos>` (start of sequence) token, in position 0.

Next, the token and positional embeddings are elementwise summed together to get a vector which contains information about the token and also its position with in the sequence — which we simply call the embedding vector. This is followed by a linear layer which transforms the embedding vector into a vector with the required hidden dimension size.

The next step is to pass this hidden vector into convolutional blocks. This is where the “magic” happens in this model. After passing through the convolutional blocks, the vector is then fed through another linear layer to transform it back from the hidden dimension size into the embedding dimension size. This is our conved vector — and we have one of these per token in the input sequence.

Finally, the conved vector is elementwise summed with the embedding vector via a residual connection to get a combined vector for each token. Again, there is a combined vector for each token in the input sequence.

Convolution Blocks

So, how do these convolutional blocks work? The below image shows 2 convolutional blocks with a single filter (blue) that is sliding across the tokens within the sequence. In the actual implementation we will have 10 convolutional blocks with 1024 filters in each block.

First, the input sentence is padded. This is because the convolutional layers will reduce the length of the input sentence and we want the length of the sentence coming into the convolutional blocks to equal the length of it coming out of the convolutional blocks. Without padding, the length of the sequence coming out of a convolutional layer will be filter_size - 1 shorter than the sequence entering the convolutional layer. For example, if we had a filter size of 3, the sequence will be 2 elements shorter. Thus, we pad the sentence with one padding element on each side. We can calculate the amount of padding on each side by simply doing (filter_size - 1)/2 for odd sized filters - we will not cover even sized filters in this tutorial.

These filters are designed so the output hidden dimension of them is twice the input hidden dimension. In computer vision terminology these hidden dimensions are called channels — but we will stick to calling them hidden dimensions. Why do we double the size of the hidden dimension leaving the convolutional filter? This is because we are using a special activation function called gated linear units (GLU). GLUs have gating mechanisms (similar to LSTMs and GRUs) contained within the activation function and actually half the size of the hidden dimension — whereas usually activation functions keep the hidden dimensions the same size.

After passing through the GLU activation the hidden dimension size for each token is the same as it was when it entered the convolutional blocks. It is now elementwise summed with its own vector before it was passed through the convolutional layer.

This concludes a single convolutional block. Subsequent blocks take the output of the previous block and perform the same steps. Each block has their own parameters, they are not shared between blocks. The output of the last block goes back to the main encoder — where it is fed through a linear layer to get the conved output and then elementwise summed with the embedding of the token to get the combined output.

Decoder

The decoder takes in the actual target sentence and tries to predict it. This model differs from the recurrent neural network models previously detailed in these tutorials as it predicts all tokens within the target sentence in parallel. There is no sequential processing, i.e. no decoding loop. This will be detailed further later on in the tutorials.

The decoder is similar to the encoder, with a few changes to both the main model and the convolutional blocks inside the model.

First, the embeddings do not have a residual connection that connects after the convolutional blocks and the transformation. Instead the embeddings are fed into the convolutional blocks to be used as residual connections there.

Second, to feed the decoder information from the encoder, the encoder conved and combined outputs are used — again, within the convolutional blocks.

Finally, the output of the decoder is a linear layer from embedding dimension to output dimension. This is used make a prediction about what the next word in the translation should be.

Decoder Convolutional Blocks

Again, these are similar to the convolutional blocks within the encoder, with a few changes.

First, the padding. Instead of padding equally on each side to ensure the length of the sentence stays the same throughout, we only pad at the beginning of the sentence. As we are processing all of the targets simultaneously in parallel, and not sequentially, we need a method of only allowing the filters translating token i to only look at tokens before word i. If they were allowed to look at token i+1 (the token they should be outputting), the model will simply learn to output the next word in the sequence by directly copying it, without actually learning how to translate.