How Is Attention Calculated?

In Korea, there is a tradition of “study” groups where engineers from various companies or backgrounds gather to explore specific engineering concepts. Participants typically select a book or an online curriculum to follow, and each week, one member is responsible for studying a chapter in depth and presenting the ideas to the group. I recently joined a study group focused on understanding how Large Language Models (LLMs) function. Last week, I was tasked with presenting the concepts of Attention and Transformers within LLMs. While I am familiar with these concepts, explaining them to others proved to be a distinct challenge compared to understanding them on my own. Here is my perspective on explaining how the Attension Score is calculated

Attention Score란?

Attention Score는 모델이 입력 데이터의 특정 부분에 더 집중하도록 가중치를 부여하는 방식이다. Attention Score는 주어진 Query와 Key의 관계를 수치화하여 계산되며, 이 값이 클수록 해당 Key가 Query와 더 밀접한 연관성을 가진다고 본다.

Query(Q), Key(K), Value(V) 도출 과정

Attention 메커니즘에서 Q(Query), K(Key), V(Value)는 입력 데이터에서 선형 변환(Linear Transformation)을 통해 얻어지는 벡터이다. Transformer 모델에서 입력은 보통 단어 임베딩(embedding)이나 그 변형된 형태로 표현되며, 이 벡터들이 $Q$ , $K$ , $V$ 로 매핑된다.

예를들면 “어제 밤에 배가 고파서 치킨을 시켜서 먹었다”라는 문장에서, “배”라는 단어가

과일 배
사람의 배
이동수단 배

중에 어떤 것을 뜻하는지 결정하려면 Attention Score를 활용하여 문맥 정보를 기반으로 “배”와 관련된 단어들 간의 관계를 학습할 수 있다.

1. 입력 데이터 준비

문장을 단어 단위로 분리하고 각 단어를 임베딩 벡터로 변환한다.
예를 들어, 각 단어가 4차원 벡터로 표현된다고 하자:

“어제” → [0.1, 0.2, 0.1, 0.3]
“밤에” → [0.0, 0.1, 0.2, 0.4]
“배” → [0.5, 0.3, 0.2, 0.1]
“가” → [0.1, 0.1, 0.1, 0.2]
“고파서” → [0.2, 0.3, 0.1, 0.0]
“치킨을” → [0.4, 0.0, 0.3, 0.2]
“시켜서” → [0.3, 0.1, 0.4, 0.1]
“먹었다” → [0.5, 0.2, 0.0, 0.1]

$\mathbf{X}$ 는 다음과 같다:

\mathbf{X} = \begin{bmatrix} 0.1 & 0.2 & 0.1 & 0.3 \\ % 어제 0.0 & 0.1 & 0.2 & 0.4 \\ % 밤에 0.5 & 0.3 & 0.2 & 0.1 \\ % 배 0.1 & 0.1 & 0.1 & 0.2 \\ % 가 0.2 & 0.3 & 0.1 & 0.0 \\ % 고파서 0.4 & 0.0 & 0.3 & 0.2 \\ % 치킨을 0.3 & 0.1 & 0.4 & 0.1 \\ % 시켜서 0.5 & 0.2 & 0.0 & 0.1 % 먹었다 \end{bmatrix}

2. Q, K 계산

$\mathbf{X}$ 를 Query(Q), Key(K) 로 변환하기 위해 학습 가능한 가중치 행렬(Weight Matrix) $W_Q, W_K$ 를 사용한다.

편의상 Weight Matrix는 아래와 같이 정의한다.

\mathbf{W}_Q = \begin{bmatrix} 0.5 & 0.1 & 0.2 & 0.2 \\ 0.2 & 0.3 & 0.1 & 0.4 \\ 0.1 & 0.5 & 0.3 & 0.1 \\ 0.3 & 0.1 & 0.4 & 0.2 \end{bmatrix}, \quad \mathbf{W}_K = \begin{bmatrix} 0.4 & 0.2 & 0.1 & 0.3 \\ 0.1 & 0.3 & 0.2 & 0.5 \\ 0.2 & 0.4 & 0.5 & 0.1 \\ 0.3 & 0.2 & 0.1 & 0.4 \end{bmatrix}

각 행렬 곱 연산을 통해 Q, K를 얻는다.

\mathbf{Q} = \mathbf{X} \mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X} \mathbf{W}_K

\mathbf{Q} = \begin{bmatrix} 0.1 & 0.2 & 0.1 & 0.3 \\ 0.0 & 0.1 & 0.2 & 0.4 \\ 0.5 & 0.3 & 0.2 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.2 \\ 0.2 & 0.3 & 0.1 & 0.0 \\ 0.4 & 0.0 & 0.3 & 0.2 \\ 0.3 & 0.1 & 0.4 & 0.1 \\ 0.5 & 0.2 & 0.0 & 0.1 \end{bmatrix} \begin{bmatrix} 0.5 & 0.1 & 0.2 & 0.2 \\ 0.2 & 0.3 & 0.1 & 0.4 \\ 0.1 & 0.5 & 0.3 & 0.1 \\ 0.3 & 0.1 & 0.4 & 0.2 \end{bmatrix} = \begin{bmatrix} 0.17 & 0.24 & 0.17 & 0.22 \\ 0.17 & 0.28 & 0.23 & 0.20 \\ 0.36 & 0.27 & 0.26 & 0.25 \\ 0.11 & 0.15 & 0.13 & 0.14 \\ 0.16 & 0.18 & 0.12 & 0.16 \\ 0.28 & 0.16 & 0.26 & 0.24 \\ 0.31 & 0.18 & 0.27 & 0.25 \\ 0.32 & 0.19 & 0.20 & 0.22 \end{bmatrix}

\mathbf{K} = \begin{bmatrix} 0.1 & 0.2 & 0.1 & 0.3 \\ 0.0 & 0.1 & 0.2 & 0.4 \\ 0.5 & 0.3 & 0.2 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.2 \\ 0.2 & 0.3 & 0.1 & 0.0 \\ 0.4 & 0.0 & 0.3 & 0.2 \\ 0.3 & 0.1 & 0.4 & 0.1 \\ 0.5 & 0.2 & 0.0 & 0.1 \end{bmatrix} \begin{bmatrix} 0.4 & 0.2 & 0.1 & 0.3 \\ 0.1 & 0.3 & 0.2 & 0.5 \\ 0.2 & 0.4 & 0.5 & 0.1 \\ 0.3 & 0.2 & 0.1 & 0.4 \end{bmatrix} = \begin{bmatrix} 0.15 & 0.22 & 0.20 & 0.28 \\ 0.18 & 0.30 & 0.28 & 0.32 \\ 0.33 & 0.31 & 0.27 & 0.30 \\ 0.10 & 0.15 & 0.13 & 0.18 \\ 0.14 & 0.20 & 0.16 & 0.19 \\ 0.28 & 0.18 & 0.23 & 0.26 \\ 0.27 & 0.24 & 0.25 & 0.27 \\ 0.26 & 0.24 & 0.20 & 0.24 \end{bmatrix}

3. Attention Score 계산

“배”의 의미를 결정하기 위해, “배”의 Query $(Q_{\text{배}})$ 와 모든 단어의 Key $(K_{\text{모든 단어}})$ 를 비교하여 Attention Score를 계산한다. Attention에서 Query(Q) 는 단어가 “묻고” 있거나 찾고 있는 것을 나타내고, Key(K) 는 각 단어가 제공하는 특징이나 문맥을 나타낸다. 따라서 아래와 같이 표현할 수 있다

Query $(Q_\text{배})$ : “문맥적으로 나와 관련된 단어를 찾아보자.”
Key $(K_\text{밤에})$ : “나는 시간적 정보를 제공하는 단어야.”
Key $(K_\text{치킨})$ : “나는 음식과 관련된 단어야.”

Attention Score는 아래 공식을 통해 계산된다.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

$d_k = 4$ 이기 때문에 $\sqrt{d_k} = 2$ 가 된다.

먼저 $\mathbf{Q} \mathbf{K}^T$ 를 계산하면:

\mathbf{Q} \mathbf{K}^T = \begin{bmatrix} 0.17 & 0.24 & 0.17 & 0.22 \\ 0.17 & 0.28 & 0.23 & 0.20 \\ 0.36 & 0.27 & 0.26 & 0.25 \\ 0.11 & 0.15 & 0.13 & 0.14 \\ 0.16 & 0.18 & 0.12 & 0.16 \\ 0.28 & 0.16 & 0.26 & 0.24 \\ 0.31 & 0.18 & 0.27 & 0.25 \\ 0.32 & 0.19 & 0.20 & 0.22 \end{bmatrix} \begin{bmatrix} 0.15 & 0.18 & 0.33 & 0.10 & 0.14 & 0.28 & 0.27 & 0.26 \\ 0.22 & 0.30 & 0.31 & 0.15 & 0.20 & 0.18 & 0.24 & 0.24 \\ 0.20 & 0.28 & 0.27 & 0.13 & 0.16 & 0.23 & 0.25 & 0.20 \\ 0.28 & 0.32 & 0.30 & 0.18 & 0.19 & 0.26 & 0.27 & 0.24 \end{bmatrix}

\mathbf{Q} \mathbf{K}^T = \begin{bmatrix} 0.156 & 0.212 & 0.199 & 0.119 & 0.158 & 0.211 & 0.229 & 0.217 \\ 0.170 & 0.236 & 0.220 & 0.131 & 0.175 & 0.233 & 0.248 & 0.230 \\ 0.274 & 0.374 & 0.342 & 0.204 & 0.266 & 0.352 & 0.375 & 0.340 \\ 0.114 & 0.154 & 0.143 & 0.085 & 0.110 & 0.147 & 0.158 & 0.144 \\ 0.131 & 0.174 & 0.159 & 0.095 & 0.126 & 0.166 & 0.176 & 0.159 \\ 0.239 & 0.315 & 0.288 & 0.173 & 0.227 & 0.302 & 0.321 & 0.293 \\ 0.264 & 0.346 & 0.316 & 0.189 & 0.249 & 0.328 & 0.349 & 0.319 \\ 0.258 & 0.339 & 0.311 & 0.186 & 0.243 & 0.320 & 0.340 & 0.313 \end{bmatrix}

스케일링을 위해 $\sqrt{d_k} = 2$ 로 나누면

\mathbf{QK^T}_{\text{scaled}} = \begin{bmatrix} 0.078 & 0.106 & 0.100 & 0.060 & 0.079 & 0.106 & 0.115 & 0.109 \\ 0.085 & 0.118 & 0.110 & 0.065 & 0.087 & 0.117 & 0.124 & 0.115 \\ 0.137 & 0.187 & 0.171 & 0.102 & 0.133 & 0.176 & 0.188 & 0.170 \\ 0.057 & 0.077 & 0.071 & 0.043 & 0.055 & 0.074 & 0.079 & 0.072 \\ 0.065 & 0.087 & 0.080 & 0.048 & 0.063 & 0.083 & 0.088 & 0.080 \\ 0.120 & 0.158 & 0.144 & 0.087 & 0.114 & 0.151 & 0.160 & 0.146 \\ 0.132 & 0.173 & 0.158 & 0.095 & 0.125 & 0.164 & 0.175 & 0.160 \\ 0.129 & 0.170 & 0.156 & 0.093 & 0.121 & 0.160 & 0.170 & 0.157 \end{bmatrix}

위에 scaling된 Matrix를 Attention Score라고 부른다. 이제 softmax를 적용한다. 우리는 “배”라는 단어가 무엇을 뜻하는지에 관심이 있기 때문에 3번 row를 대상으로 대표 계산을 진행하고 softmax를 적용한다.

\text{Softmax}(\text{Row 3}) = \frac{\exp(x_i)}{\sum_{j}\exp(x_j)}, \quad x_i \in \text{Row 3}

\exp(\text{Row 3}) = \begin{bmatrix} 1.147 & 1.205 & 1.187 & 1.108 & 1.143 & 1.193 & 1.207 & 1.186 \end{bmatrix}

각 항목의 합을 구하면

\text{sum} = 1.147 + 1.205 + 1.187 + 1.108 + 1.143 + 1.193 + 1.207 + 1.186 = 9.376

sum을 활용해 normalization을 적용하면,

\text{Softmax}(\text{Row 3}) = \begin{bmatrix} 0.122 & 0.129 & 0.127 & 0.118 & 0.122 & 0.127 & 0.129 & 0.127 \end{bmatrix}

다른 row들에도 모두 softmax를 적용하면 아래 Attention Weights가 계산된다.

\mathbf{AW_i} = \begin{bmatrix} 0.123 & 0.127 & 0.126 & 0.121 & 0.123 & 0.127 & 0.128 & 0.127 \\ 0.123 & 0.127 & 0.126 & 0.120 & 0.123 & 0.127 & 0.128 & 0.126 \\ 0.122 & 0.129 & 0.127 & 0.118 & 0.122 & 0.127 & 0.129 & 0.127 \\ 0.124 & 0.127 & 0.126 & 0.122 & 0.124 & 0.126 & 0.127 & 0.126 \\ 0.124 & 0.127 & 0.126 & 0.122 & 0.124 & 0.126 & 0.127 & 0.126 \\ 0.122 & 0.127 & 0.125 & 0.118 & 0.122 & 0.126 & 0.127 & 0.125 \\ 0.123 & 0.128 & 0.126 & 0.118 & 0.122 & 0.127 & 0.128 & 0.126 \\ 0.123 & 0.128 & 0.126 & 0.119 & 0.122 & 0.127 & 0.128 & 0.126 \end{bmatrix}

4. 최종 Output 계산

Attention Weight를 Value (V)에 곱하여 최종 Output을 계산한다:

\text{Output}_{\text{배}} = \sum_{i} \text{Attention Weight}_i \cdot V_i

Value(V)는 각 단어가 “전달할 정보”를 담고 있는 벡터이다.
앞서 $\mathbf{X}$ 와 학습된 $W_V$ 를 사용해 V를 계산하면 다음과 같다:

\mathbf{W}_V = \begin{bmatrix} 0.3 & 0.1 & 0.2 & 0.4 \\ 0.1 & 0.4 & 0.3 & 0.2 \\ 0.4 & 0.2 & 0.1 & 0.3 \\ 0.2 & 0.3 & 0.4 & 0.1 \end{bmatrix}

\mathbf{V} = \mathbf{X} \mathbf{W}_V = \begin{bmatrix} 0.1 & 0.2 & 0.1 & 0.3 \\ 0.0 & 0.1 & 0.2 & 0.4 \\ 0.5 & 0.3 & 0.2 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.2 \\ 0.2 & 0.3 & 0.1 & 0.0 \\ 0.4 & 0.0 & 0.3 & 0.2 \\ 0.3 & 0.1 & 0.4 & 0.1 \\ 0.5 & 0.2 & 0.0 & 0.1 \end{bmatrix} \begin{bmatrix} 0.3 & 0.1 & 0.2 & 0.4 \\ 0.1 & 0.4 & 0.3 & 0.2 \\ 0.4 & 0.2 & 0.1 & 0.3 \\ 0.2 & 0.3 & 0.4 & 0.1 \end{bmatrix} = \begin{bmatrix} 0.17 & 0.18 & 0.18 & 0.18 \\ 0.22 & 0.20 & 0.18 & 0.18 \\ 0.32 & 0.26 & 0.23 & 0.21 \\ 0.13 & 0.13 & 0.13 & 0.12 \\ 0.17 & 0.17 & 0.15 & 0.12 \\ 0.30 & 0.21 & 0.23 & 0.20 \\ 0.32 & 0.23 & 0.24 & 0.21 \\ 0.29 & 0.22 & 0.20 & 0.18 \end{bmatrix}

이제 “배”에 대한 Attention Output을 계산한다

\mathbf{A}_{\text{Row 3}} \times \mathbf{V} = [0.122, 0.129, 0.127, 0.118, 0.122, 0.127, 0.129, 0.127] \times \begin{bmatrix} 0.17 & 0.18 & 0.18 & 0.18 \\ 0.22 & 0.20 & 0.18 & 0.18 \\ 0.32 & 0.26 & 0.23 & 0.21 \\ 0.13 & 0.13 & 0.13 & 0.12 \\ 0.17 & 0.17 & 0.15 & 0.12 \\ 0.30 & 0.21 & 0.23 & 0.20 \\ 0.32 & 0.23 & 0.24 & 0.21 \\ 0.29 & 0.22 & 0.20 & 0.18 \end{bmatrix} =

\mathbf{A}_{\text{Row 3}} \times \mathbf{V} = [0.275, 0.222, 0.206, 0.183]

전체 row에 대해 계산하면 아래와 같은 Attention Output 결과를 볼 수 있다

\text{Attention Output} = \begin{bmatrix} 0.267 & 0.219 & 0.204 & 0.181 \\ 0.267 & 0.219 & 0.204 & 0.181 \\ 0.275 & 0.222 & 0.206 & 0.183 \\ 0.263 & 0.218 & 0.203 & 0.180 \\ 0.263 & 0.218 & 0.203 & 0.180 \\ 0.270 & 0.221 & 0.205 & 0.182 \\ 0.272 & 0.221 & 0.205 & 0.182 \\ 0.271 & 0.220 & 0.204 & 0.181 \end{bmatrix}

5. Attention Score를 통한 의미 결정

최종 Output [0.73, 0.32]는 “배”가 문맥에서 가장 중요한 단어인 “고파서”와 강하게 연결된 정보를 반영한다. Output이 “배”가 “고파서”와 가장 강하게 연결됐다는 Attention Score를 활용해서 계산된 결과이기 때문이다

이 Output은 다음과 같은 역할을 한다:

문맥적 의미 제공: “배”가 “고파서”와 연관되어 “사람의 배”로 해석될 가능성이 높다는 문맥적 정보를 포함한다.
후속 레이어 활용: Transformer의 다음 레이어는 이 정보를 개반으로 “배”의 문맥적 역할을 더 정교하게 학습한다.
최종 작업에 반영: 번역, 요약, 질문응답 등 NLP 작업에서 “배”의 문맥적 의미가 정확히 반영되도록 지원한다.

최종 Output 벡터는 Transformer의 후속 레이어에서 사용되며, 다음과 같은 방식으로 “배”의 의미를 결정한다:

문맥 기반 표현 학습: Output 벡터는 “배”가 다른 단어와의 문맥적 관계를 학습한 결과다.
- “배”와 “고파서”, “치킨”의 관계 → “사람의 배”일 가능성이 높음.
- “배”와 “밤에”, “치킨”의 관계 → “과일 배”일 가능성.
- “배”와 “밤에”의 관계 → 높진 않지만 “이동수단 배”일 가능성.
다음 작업 수행:
- 질문 답변(Task): “배”가 무엇인지 명확히 설명.
- 번역(Task): “배”를 상황에 따라 다른 언어로 번역.

이제 Attention Score가 어떻게 계산되는지 이해했으니, 이를 기반으로 Transformer 모델이 어떻게 학습되는지 알아보도록 하겠다.

Dec 2, 2024

AI Enthusiast and a Software EngineerJason Kang on LinkedIn