🎛️ How do Mixture of Expert Models Work?

Introduction

The latest model making waves in the Open Source Language Model Sweepstakes is Mixtral — an 8 x 7B mixture of experts model from the folks at Mistral AI. If you look at the LMSys Chatbot Arena Leaderboard, Mistral’s model are third or fourth, only behind the OAI’s, Google’s and Anthropic’s of the world. Mixtral is the best open-source model according to these metrics:

Leaderboard MoE Arch

There’s also equivalent rumors that GPT-4 (by far the reigning champion) is a 8 x 220B mixture of experts model. All this leads to an important question: What even, is a mixture of experts model?

Note that there’s a bunch of MoE shaped work outside of the LLM domain— in this blog, we’re going to stick to LLM MoEs.

Motivating Mixture of Expert Architectures

Let’s start with a simple question: Why would you try to make a 8x7B frankenstein when you could make a 56B dense model?

There are two different goals to keep in mind:

Ensemble learning: When you train multiple different versions of the same base arch in slightly different ways, you’re able to create a set of models that can compliment each other. This is the idea behind random forests or multi-head attention.
Conditional Computation: If you are able to retain the same level of performance from activating fewer parameters per iteration at inference or training time, you can make your model run faster and be more compute efficient. Yoshua Bengio of Turing Award winner fame has some really interesting work in this domain.

MoEs are intended to incur both these benefits onto large language models.

How do MOEs Work?

An MoE in concept looks something like this:

Moe Diagram

It is useful to think of them as replacements for an MLP in the network — so you can stack up as many blocks of MoEs as needed in the model itself. Here’s another visualization that might help encapsulate this idea:

Moe Visaulization

The Math Behind MoEs

The output of the MoE is a weighted sum of the “gating weights” and the “expert outputs”.

$$ y = \sum_{i=1}^{n} \underbrace{G(x)_i}_{\text{gating weight for gate $i$}} \overbrace{E_i(x)}^{\text{output of expert i}} $$

Here, each $E$ is an MLP itself. But what is $G$? There can be many choices. Let’s start with a super simple one:

$$ G_{\sigma}(x) = \text{Softmax}(x.W_g) $$

The softmax function is intended to take it some output vector and convert it such that all values are positive and add up to 1 — this makes it such that the output is the probability of that value and the entire vector is a distribution.

But there is a problem with Softmax — each of the values of softmax are rarely zero, which means it is going to be challenging to fulfill that promise of conditional computation, even if we set up a great ensemble learning network.

So instead, we modify the equation a bit, to this instead:

$$ G_{\sigma}(x) = \text{Softmax}(\text{TopK}(x.W_g)) $$

Where the output of the gating network now is the softmax of only the top K values, and the rest get set to 0 (effectively turning off those gates and allowing for conditional computation). k is a hyperparameter here that we choose before training time.

Top K Formalism

Here’s what Mixtral uses:

$$ y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x) = \text{Softmax}(\text{Top2}(x.W_g))\cdot \text{SwiGLU}_i(x) $$

Let’s reiterate the value of this: It allows us to keep scaling the total number of experts while keeping the number of “activated” experts constant, keeping training and inference time latency low.

Limitations of Mixture of Experts

But of course, all this is sounding too good to be true… let’s discuss now the hiccups that MoEs introduce.

Weights still need to be loaded at training/inference time.

Since we don’t know which expert we’ll choose at training or inference time, we need to load all of them up, which means our RAM requirements remain extremely high, alongside the need to use parallelism strategies like tensor and pipeline parallelism.
Batch size is effectively reduced.

In the average case, for a batch of $100$ datapoints, each of the experts can only see $(100 // \text{num experts})$ datapoints. In the worst case, an expert may never be chosen for an entire batch.
Training can be imbalanced.

Imagine 100 sequences and 10 experts. 91 of the sequences go to one of the experts, while 9 of the sequences go to the remaining. This causes underutilization of the model architecture and might destabilize training down the line.
We haven’t demonstrated yet that, at the same parameter count, MoE models outperform dense models.

We have evidence that MoE models outperform, for the same number of activated parameters, but that of course leans in favor MoE models.

Load Balancing

Tactics for Load Balancing

Failure case: Imagine 10 experts. Only one of them gets selected all the time, leading to really only training one of the experts.

The solution: Load balancing. Instances of Load Balancing:

Auxiliary Loss

Add an auxiliary loss: Keeps the number of tokens per expert even

Auxiliary Loss

Random Routing

Set K=2, make the second expert selected randomly by weights

Expert Capacity

Set some capacity on how many tokens one expert can see. If overflow, move to next layer or drop the tokens.

Expert Capacity

This value is often pretty low! between 1-1.25

Failures of Load Balancing

Imagine you use all the load balancing strategies. You might yet run into a failure case: The experts don’t specialize. They end up at ~the same weights, and it doesn’t matter which expert you choose. Our load balancing efforts increase the likelihood of this scenario.

From ST-MOE: “due to token routing and load balancing, there is no single expert specialized in any given language.”

404: No (Human-Interpretable) Experts Found

So… do each of the sub models in fact become an “expert” in some sub-domain. Do we have code experts and humanities experts and science experts?

tl;dr: It doesn’t seem so— at least not in a way that humans can interpret the model weights.

From Mixtral: “To investigate this, we measure the distribution of selected experts on different subsets of The Pile validation dataset. Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic.”

Mistral Layer to Expert Map

MoE Models Outperform Dense Models (With a Catch)

At equal activated parameter counts, it’s clear that MoE models outperform. From DeepSpeed MoE:

Deepspeed model comparisons

And from Switch Transformers:

Switch transformer model comparisons

Consider this evidence with some context: A lot of this evidence was collected in the before era of deep learning— T5-Large here is a billion params. We do not know if this trend holds at even larger scales. We have yet to see released latest evidence on the success of MoEs vs dense models.

Conclusions

MoEs follow a long line of works in machine learning that attempt to use ensemble learning— but in practice, the element of conditional computation and lower latency remains critical, especially lacking definitive evidence of MoE superiority in its current implementation.

While I don’t think it is MoEs that will break the grasp that dense model scaling seems to have on industrial research right now, every time I do see more work in this domain, it pleases me greatly. Looking forward to seeing where it goes!

Acknowledgements

Thank you to Anton Zabreyko, Helen Buchanan, Yuqing Wang, and the folks at Adept AI.