Attention Mechanism

Fill in a module description here

import torch
from torch import nn

def addition(a,b):
    "Adds two numbers together"
    return a+b

We take the sum of a and b, which is written in python with the “+” symbol

source

A

 A ()

Initialize self. See help(type(self)) for accurate signature.

from fastcore.foundation import docs

Code
Code + Explanation

@docs
class PrepareForMultiHeadAttention(nn.Module):
    def __init__(self, d_model, heads, d_k, bias):
        super().__init__()
        self.linear = nn.Linear(d_model, heads * d_k, bias=bias)
        self.heads = heads
        self.d_k = d_k

    def forward(self, x):
        head_shape = x.shape[:-1]
        
        x = self.linear(x)
        x = x.view(*head_shape, self.heads, self.d_k)
        
        return x
    
    _docs = dict(cls_doc="xxx",
                 forward="yyy")

@docs
class PrepareForMultiHeadAttention(nn.Module):
    def __init__(self, d_model, heads, d_k, bias):
        super().__init__()
        self.linear = nn.Linear(d_model, heads * d_k, bias=bias)
        self.heads = heads
        self.d_k = d_k

    def forward(self, x):
        head_shape = x.shape[:-1]
        
        x = self.linear(x)
        x = x.view(*head_shape, self.heads, self.d_k)
        
        return x
    
    _docs = dict(cls_doc="xxx",
                 forward="yyy")

self.heads = heads

dasdasd

Multi-head Attention

In practice, we don’t compute each attention score at once, but we concentrate all the key to one matrix, same for value and query. Then calculate all the attention score of these at once.

\[\operatorname{Attention}(Q, K, V)=\underset{\text { seq }}{\operatorname{softmax}}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V\]

d_model: the number of features in query, key, and value vectors.
heads: the number of attention layers.
d_k: the number of dimension of each vector in each head

class MultiHeadAttention(nn.Module):
    def __init__(
        self,
        heads: int,
        d_model: int,
        dropout_prop: float=0.1,
        bias: bool = True
    ):
        super().__init__()
        self.d_k = d_model // heads
        self.heads = heads
        self.query = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias)

source

MultiHeadAttention

 MultiHeadAttention (heads:int, d_model:int, dropout_prop:float=0.1,
                     bias:bool=True)

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool