UP | HOME

Building and Training Neural Networks in Emacs Lisp

Yes you read that correctly. This blog post showcases nn.el, my (zero dependency) emacs lisp package that allows you to build and train neural networks. To answer the initial question as to “why on God’s green earth would you write a machine learning framework in emacs lisp?”, the answer is really just for fun (and to cause a bit of a splash, i mean emacs lisp is the last thing you’d associate with AI right now). This particular flavour is really only used to configure emacs, but you can actually do much more1 and that includes training neural networks. If your unfamiliar with neural networks the section Primer on Neural Networks provides a quick overview on neural networks and their training, as well as the vast resources available online on the topic.

Introducing nn.el

nn.el makes use of dynamic modules, that is, extra functionality implemented for use in emacs lisp programs. I’ve written some dynamic modules in c that provide some basic functions to manipulate matrices e.g. matrix multiplication, as well as some of mathematical functions typically used by neural networks e.g. softmax and ReLU.

The package consists of the following libraries:

matrix.c
This library (i.e. a dynamic emacs module) provides some basic functions to manipulate matrices (vectors of vectors in emacs lisp). This includes addition, subtraction, transpose, scalar multiplication, matrix multiplication, etc.
ops.c
This library provides some mathematical functions and operations typically used in machine learning. This includes operations like ReLU, softmax, etc.
nn.el
This is the main neural network library, written in emacs lisp. It uses the previous libraries to provide functionality to build neural networks, compute the output of a neural network for a given input, and train a neural network.

The package has zero dependencies besides emacs. The libraries matrix.c and ops.c are written in c so that we can calculate outputs and gradients in a reasonably fast manner. They have no external dependencies (eg no python process is ever invoked!).

The code is available here: https://github.com/namilus/nn.

Building Neural Networks with nn.el

nn.el has a simple API for building neural networks, and is based around TensorFlow’s sequential building API. Below is an example:

(require 'nn)
(require 'matrix)
(require 'ops)

(setq model `(,(nn-layer 100 80 'ops-relu)
              ,(nn-layer 80 8 'ops-relu)
              ,(nn-layer 8 3)))

The above creates a 3 layer neural network that takes an input \(\textbf{x} \in \mathbb{R}^{100}\) and outputs a vector \(\in \mathbb{R}^3\). The first layer has 80 neurons, the second as 8, and the final outputs just 3 neurons. In total, this particular model has 8755 parameters, however nn.el allows you to build arbitrary depth and width networks. The weights and biases of the model are initialised randomly from the uniform distribution \(U(-1,1)\).

Next we see how we can calculate the output of the model on an input:

(setq x (matrix-random 100 1))
(setq y (matrix-transpose [[1 0 0]]))

(ops-softmax (nn-forward-layers x model))
[[0.2600744539110965] [7.306195822330744e-07] [0.7399248154693213]]

Here our \(\textbf{x}\) is a random vector, and the label is \([1 \ 0 \ 0]^T\). The output of our model on \(\textbf{x}\), given above, is not the same as our label, and is actually quite a bit off. Let’s calculate the loss between our model’s output, to see how off it is:

(let* ((s (ops-softmax (nn-forward-layers x model)))
      (loss (nn-crossentropy y s)))
  (message "Initial loss: %s" loss))
Initial loss: [[1.346787327763626]]

Training Neural Networks with nn.el

The initial loss value is high, so we’d like to train our model to better predict the label for our input. nn.el provides functionality to calculate the gradients for a model for a given batch of training examples. This is done with the nn-gradient function. We can use this to implement gradient descent and train our model:

(defun nn--apply-gradient-sgd-layer (grad layer)
  "Apply GRAD to LAYER using gradient descent"
  (let ((w (nth 0 layer))
        (b (nth 1 layer))
        (wg (nth 0 grad))
        (bg (nth 1 grad)))
    (setf (nth 0 layer) (matrix-subtract w (matrix-scalar-mul 0.01 wg)))
    (setf (nth 1 layer) (matrix-subtract b (matrix-scalar-mul 0.01 bg)))))

(dotimes (counter  100)
  ;; this is the train step; essentially just do this as many times
  (let ((grads (nn-gradient x y model)))
    (seq-mapn #'nn--apply-gradient-sgd-layer grads model)))

The above code performs 100 gradient descent steps, tweaking the parameters of the model each time in order to improve it’s output on input x. Now let’s see what the model outputs:

(ops-softmax (nn-forward-layers x model))
[[0.9990508146033623] [1.9576943624708923e-08] [0.00094916581969408]]

Wow! That’s much closer to our desired label than before! And it was reasonly fast2 too!Let’s see what the loss function returns:

(let* ((s (ops-softmax (nn-forward-layers x model)))
      (loss (nn-crossentropy y s)))
  (message "After training loss: %s" loss))
After training loss: [[0.0009496361583564819]]

Almost zero! Exactly what we wanted.

Conclusion

We’ve now successfully built and trained a neural network with nn.el on some (albeit random) data. There’s nothing stopping us now from training on real, much larger datasets (e.g the database of handwritten digits). I’ll save that for a future blog post, but the main functionality of nn.el has been showcased. You probably wont be able to run an AI startup using it, but it is funny to say that you’re text editor can train neural networks. A longer blog post will follow that will explain in some more detail how nn.el works. The code is up on Github for anyone who is interested.

Primer on Neural Networks

What follows is a high level overview of neural networks and how they’re trained. If you’re already familiar, feel free to skim this section or skip ahead.

Neural networks consists of several layers of fully connected neurons. The output of a \(L\)-layer fully connected neural network with parameters consisting of weights and biases \(\textbf{W}_1, \textbf{b}_1 \textbf{W}_2, \textbf{b}_2, ..., \textbf{W}_L, \textbf{b}_L\) (each \(\textbf{W}_i \in \mathbb{R}^{d_{i-1} \times d_i}\), \(\textbf{b}_i \in \mathbb{R}^{d_i}\)) is given by \( \textbf{z}_L = \textbf{W}_L^T \textbf{a}_{L-1} + \textbf{b}_L\), with \(\textbf{a}_i = h(\textbf{z}_i)\), where \(\textbf{z}_i = \textbf{W}_i^T\textbf{a}_{i-1} + \textbf{b}_i\), and \(h(x) = \mbox{max}(0,x)\) is the activation function (ReLU in this case).

Let \(n = d_L\) denote the number of classes, and \(\textbf{a}_0 = \textbf{x} \in \mathbb{R}^{d_0}\) i.e., the input to the neural network. A training example for a neural network is the pair of input and label \((\textbf{x}, \textbf{y})\), where the input has \(d_0\) number of features i.e., \(\textbf{x} \in \mathbb{R}^{d_0}\), and we want the neural network to classify this input as one of \(n\) classes. Typically, label \(\textbf{y}\) is a vector of \(n\) elements, which are all zeros except for the true class, which is a one. These types of vectors are called one-hot vectors because they have a 1 at the index which encodes the label. For example the label \(\textbf{y} = [0 \ 1 \ 0 \ 0]^T \in \mathbb{R}^4\) says that the corresponding input is labelled as the second class. These classes can be anything e.g., they can encode what is in a image \(\textbf{x}\) which is a vector of the pixel values of the image. In next word prediction models like ChatGPT, these vectors encode what the next word should be, given the previous word \(\textbf{x}\), encoded as a vector.

Training Neural Networks

Training a neural network on an training example \((\textbf{x}, \textbf{y})\) involves tweaking all the weights and biases \(\textbf{W}_i, \textbf{b}_i\) so that output of the model on the input \(\textbf{x}\) matches the label \(\textbf{y}\). In order to do this, we first need a measure of how off the model output is from the label. This is called the loss function, and cross entropy is typically used. Cross entropy loss calculates how different two probability distributions are. In our case, the two distributions are the true label and softmaxed model output. Softmax is a function that converts a vector into a probability distribution. More formally, the per-example cross entropy loss is given by \(\ell(\textbf{x}, \textbf{y})\) and given a batch of training examples \( B = \{(\textbf{x}^{(1)},\textbf{y}^{(1)}), ..., (\textbf{x}^{(b)},\textbf{y}^{(b)})\}\), of size \(b\), the average loss is given by

\begin{equation} \mathcal{L}(B) = \frac{1}{b} \sum_{(\textbf{x},\textbf{y}) \in B} \ell(\textbf{x}, \textbf{y}) \end{equation}

Training neural networks is done using gradient descent. This is an algorithm that allows us to find what values of \(\textbf{W}_i\) and \(\textbf{b}_i\) minimise the loss function i.e., which values for the parameters result in an output that matches the label. Gradient descent works as follows:

  1. Calculate the derivative of the loss with respect to the parameters (e.g. the gradient) \(\nabla_{\textbf{W}_i} \mathcal{L}(B)\) and \(\nabla_{\textbf{b}_i} \mathcal{L}(B)\).
  2. Update the weights using gradient, \[\textbf{W}_i := \textbf{W}_i - \eta\nabla_{\textbf{W}_i} \mathcal{L}(B)\] \[\textbf{b}_i := \textbf{b}_i - \eta\nabla_{\textbf{b}_i} \mathcal{L}(B)\]

The variable \(\eta\) is the learning rate, and dictates how fast we want to descend down the gradient. Typically, a value less than 1 is chosen. Gradient descent repeats steps 1 and 2 iteratively until the loss function returns a value that is sufficiently small e.g. close to zero.

Footnotes:

1

Emacs lisp is turing complete.

2

Took around ~2 seconds on a (reasonably) powerful i9 CPU.