NeuroGates and Artificial Contextual Gates

This Paper Was Written by Maxime, author of this site.

Please reference it / me if you want to use any idea or concept that originates from it.

Intro

This concept was inspired by this paper and others (links coming).
This isn’t a complete research paper. It is a quick overview, and doesn’t include any testing. The benchmarks will be found here (link coming).

Abstract

Artificial Neural Networks (ANNs) are supposed to mimic natural processes of thinking, and dealing with information, and stimuli. Most current ANNs will react the exact same way each time to the same input, and same internal state, no matter the context due to determinism in non-quantum computing. Though, a living being will take different decisions based on its context, and itself, even if the context is simple, as shown here, it can take the form of memory, or neuro-transmitters affecting the way the neurons fire, or the path the data takes in the circuits. I’ve chosen the latter approach, as Neurotransmitters aren’t addressed enough I feel like. The first study I talked about chose to change the way the neurons fire, or more precisely how they transmit the same recieved information, so, I went for the other approach, and tried to make it compatible, in theory and in terms of implementations. The goal here is then to change the used pathways and circuits depending on context, like your biological brain wouldn’t use the same areas or hemispheres for different tasks. We can hence say this is a step to un-narrow artificial intelligence, as multiple tasks can be handled with the same network.

Theory

Now that we have a goal set, we have to determine how we’re going to achieve it.

We mentioned key concepts above, they’re the features we need to add:

  • Context

  • Neural Pathways

  • Adaptation

To understand how we are going to change the current techniques, we first need to understand them.
We can sum up a ANN in two ways, I’m going to keep both all the way, one Math reliant, and one that’s more Visual. (if you don’t see anything, it’s that I haven’t had time to make pretty visuals yet)
$$
ANN(\bar x, θ, Φ) \approx \bar y
$$
Here, θ is all the weights of the ANN, and, sometimes we have Φ that represents the internal state. I chose Φ because it’s like Theta, but with a I for Internal down the middle.
A standard, linear layer can be written like so:
$$
L(\bar x, θ, Φ) = \sigma(\bar x \cdot \theta_{weight} + \theta_{bias}) \approx \bar y
$$

Now, let’s write a pseudo-equation for every point we’d like to implement:

  • Context:

    $L(\bar x, θ, Φ, ξ) = \space??? \approx \bar y$

    Here, we’re adding ξ, which the greek c, to represent the context. Don’t worry, it’s just another tensor we have to process.
  • Neural Pathways:

    $L(\bar x, θ, Φ) = \sigma(\bar x \cdot \theta_{weight} + \theta_{bias}) * ? \approx \bar y$

    The data should be treated differently depending on something.
  • Adaptation

    $L(\bar x, θ, Φ, ξ_i) \neq L(\bar x, θ, Φ, ξ_j)$ for different contexts $ξ_i$ and $ξ_j$

See, it’s all coming together quite nicely, we just have to make a connection between these three properties.

Idea

In the context of a layer L

Analyst

First, let’s have a circuit processing $ξ$, and deciding what to do with it. We’ll call it the Analyst $A$. We can define $A$ like so:
$$
A(ξ, θ_A, Φ_A) = \vec r
$$
Here, we’ve defined everything but $\vec r$. It is a vector representing the task repartition between the:

Operators

Now that we know what to do with the context, we still need to process the input data. Each operator $O_i$ (there will be multiple ones, to have multiple paths) can be defined like so:
$$
O_i(\bar x, θ_{O_i}, Φ_{O_i}) \approx \bar y
$$
I know what you’re thinking: “Hey ! I’ve seen this one ! It’s just a basic layer L in disguise”. And, you’d be right. The only difference is that it tends to be smaller, and isn’t called without every other operator, under the supervision of the Analyst.

Interaction Between Both

Okay, we have the output of every component of out Gate, how do we get the actual output of it ?
Simply take the output of every Agent, multiply it by its part of the job (in $\vec r$), and take the sum of all that.
It is strongly recommended to use softmax on $\vec r$, to make the total ammount of work done 1.

Introducing a little syntax: $[E_k]^K_ {k=0}$ means $K$ elements in a list, with $k$ as an increment indicator

The mini-outputs would look like so: (Shortened the output of $A$ to be $A_{output}$ or $\vec r$, same for $O$s)
$$
\vec O_{output} = [O_{output_0}, …, O_{output_k}] \text{ or } \big[ O_ {output_k} \big]^{O_ {length}}_ {i = 0}
$$$$
\hat y = \vec O \cdot softmax(\vec r)
$$

Layer Equation

Buckle up, this is… well, Math. Really, it’s just the expanded, general version of what we’ve done so far.

$$
Gate(\bar x, θ, Φ, ξ) = \sigma \bigg( softmax( A(ξ, θ_ A, Φ_ A) ) \cdot \Big[O_ i(\bar x, θ_ {O_ i}, Φ _ {O_ i})\Big]^N_ {i=0} \bigg)
$$

I hope I detailed it enough. Basically it gives more or less power to Operators depending on the context.
It should work if all of these are satisfied:

  • $A$ is a function
  • $O_i$ are functions
  • $A$ outputs a vector of size $O_{size}$ ($N$)
  • All $O_i$ have the same input/output sizes

Which makes it a very powerful and versatile concept.

One Implementation

Okay, here’s the equation, then I’ll simply explain it: (I suffered greatly trying to write this):
$$
Gate(\bar x, θ, ξ) = \sigma_{Gate} \Bigg( softmax( ξ \cdot \theta_{A_{weight}} + \theta_{A_{bias}} ) \cdot \Big[\sigma_O(\bar x \cdot \theta_{O_{i_{weight}}} + \theta_{O_{i_{bias}}})\Big]^N_ {i=0} \Bigg)
$$

Now, this may look horrendous, and is does, but what I’ve done is I’ve filled in the blanks.
I simply added a function for $A$ and for $O_i$.

Now, how would you actually code gates ?

Here’s a pseudocode snippet:

1
2
3
4
5
6
def forward:
vector_r = softmax(Analyst(context))
stacked_ops = stack of op(inputs) for op in operators on the dimension 1
r_reshaped = reshape pred to [-1, vector_r, ...other_op_dims]
results = vector_r * stacked_ops
output = sum on axis 1 (mult_ops)

(in-code tested, tried and true)

or

1
2
3
4
def forward:
vector_r = softmax(Analyst(context))
stacked_ops = stack of op(inputs) for op in operators on the dimension 0
output = vector_r dot_product stacked_ops

(mathematically tested, never implemented)

Now, you just have to provide layers / functions for your Analyst and your Operators.

Usage and Tips

Okay, now that we have Gates, what do we do with those ?
Well, remember their goal is to increase complexity and adaptability. Also, since every layer has multiple Operators, you can afford to make them smaller.

  • Smaller $\theta$

    Let’s compute the size of $\theta$, for a normal linear model, then for a Gated linear model: (for simplicity, $L_i$ will denote the number of units in the layer)
    The linear model is just the size of the weight plus the size of the bias, ${input} * {output} + {output}$
    $$
    \theta_{size} = \sum_i L_{i} * ( L_{i-1} + 1 )
    $$
    The Gated model is a little more complicated to quantify, but overall we have the same thing, a linear model, but for every layer, there are the Operators and the Analyst:
    $$
    \theta_{size} = \sum_i \sum_j \Big( O_ {i_ j} * (O_ {i-1} + 1) \Big) + A_ i * (ξ_ {i} + 1)
    $$
    This seems more, right ? But what if we made two models, with the same ammount of units $U$, and the gated layers will be divided in $N$ operators.
    The normal model stays the same: $\theta_{size} = \sum_i U_{i} * ( U_{i-1} + 1 )$
    The Gated, in the other hand, changes: $$\theta_ {size} = \sum_ i \sum_ j \Big( \frac{U_ i}{N_ i} * (\frac{U_ {i-1}}{N_ {i-1}} + 1) \Big) + N_ i * (ξ_ {i} + 1)$$
    There are still a lot of numbers, right, but see that $\frac{U_i}{N_i}$ ? This can make your model far tinier.
    Let’s plug-in some numbers, then I’ll show you a few graphs.
    Imagine a 4-layered model, every layer being $U_i = 100$ for the sake of argument, and we’ll split this into 4 operators. The input and output size will stay frozen at $100$, or else it’d be cheating. The context would be 10, again, arbitrary values, but we could expect the context to be 10x smaller than the input.
    The normal model would be: $\theta_{size} = \sum^4_i 100 * ( 100 + 1 ) = 4 * (100 * (100 + 1)) = 40400$
    I take into account the fact that the input and output layers would be the same, so I had to add two extra terms.
    The Gated model would be:
    $$\theta_{size} = 25876 $$$$ = \sum^4_j (\frac{100}{4} * (100 + 1) + 4 * (10+1)) + \sum^2_i \Bigg( \sum^4_j \Big( \frac{100}{4} * (\frac{100}{4} + 1) \Big) + 4 * (10 + 1) \Bigg) + \sum^4_j (100 * (\frac{100}{4} + 1) + 4 * (10 + 1)) $$$$= \big[4 * 25 * (100+1) + 4 * (10 + 1)\big] + \big[2 * (4 * (25 * (25 + 1)) + 4 * (10 + 1))\big] + \big[4 * 100 * (25 + 1) + 4 * (10 + 1) \big] = 25876$$
    I think we can all agree that $40400 > 25876$. If you think this is cherry-picked, here are a few graphs:
    This graphs plots the size of $\theta$ for a Gated model (blue) and a Normal model (orange) with the same number of units with the same rules as before. $U_i=100$ and $O_{size}=4$
    $$\theta_{size}(L_{size})$$
    GateThetaVsNormalThetaLayerWise
    This one plots the size of $\theta$ for a Gated model (blue) and a Normal model (orange) with the same rules, now it’s only the number of units $U_i$ that changes.
    $$\theta_{size}(U_i)$$
    GateThetaVsNormalThetaUnitWise
  • More Complexity

    Now, because of the Neural Pathways we’ve created, after each and every layer, the data has multiple paths it can take.
    Let’s say, instead of $softmax$, we had a function that made the highest value 1 and the others 0: $\sigma_{max}([.2, .4, .8, .7]) = [0, 0, 1, 0]$. This reduces the ammount of possible paths, and still, we’ll see it’s quite impressive.
    After each layer $L$, the data can enter $N$ different operators. Since $\sigma_{max}$ doesn’t allow for the other Operators to affect the data, we only have $N$ different pathways.
    Though, as you may know, with more layers, these choices branch out, and, for every path, there are $N$ new ones.
    The ammount of pathways the data can take using $\sigma_{max}$ is $N^L$, which, we can all agree on: $N^L > 1$. A normal ANN only has one path.
    Now, it’s impossible to find out how many paths can be used with $softmax$ instead of $\sigma_{max}$, because there are infinitely many. You can still get a solid idea of how much it branches out, as you increase $N$ and $L$.

Other benchmarks tend to go in this direction, but here we’re focusing on theory.

Pros & Cons

  • Pros

    • Compatibility

      Gates can be used along with any type of functions or DL technology, since by concept, all it needs is an Analyst function, and a list of Operator functions.
    • Speed

      Since $\theta_{size}$ is so much tinier, and every operator is relatively cheap, the process of feedforward is far faster with Gates with the same amount of Units.
    • Complexity

      Already explained above, the multiple paths allow for non-linear computations easily. The complexity increases quadratically with the number of chained Gates.
    • Adaptibility

      You can train your model for a large amount of tasks, by changing the context fed in.
    • Potential

      Coupled to some optimization technique that would prevent forgetting, which is an issue usually in DL, the model could learn an enourmous amount of tasks, reuse the operators according to the situmation, without ever forgetting too much, being a huge step towards AGI. Also, coupled with Neural Expansion, these models could have a lifelong learning experience that’s greatly simplified by their ease of adapting to context.
  • Cons

    • ANNs forge their own pathways

      You could say normal DANNs already forge their own pathways, and this is overkill.
    • Gates only work in chains

      Even though they don’t lose complexity when not chained, shallow networks and any lonely gate loses any speed advantage they might’ve had.
    • Input / Output inefficiency

      Theory hasen’t figured out why, but input and output layers using Gates seem to benefit less from the upgrade.