Download EMNIST manually

EMNIST is a classic image data set for machine learning. Sometimes the automatic PyTorch download fails, that bugs me. Here’s a quick guide to download the EMNIST data set manually and make it work with PyTorch.
programming
python
machine learning
technical
Author
Published

April 15, 2024

This is a quick reference for my future self, maybe it’s helpful for you as well.

TL;DR: Manual EMNIST data download requires directory name updates to make PyTorch happy. Need ./EMNIST/raw/<binaries>.

Problem: Automatic EMNIST download failed

Earlier today, I wanted to reproduce the results of a machine learning paper that uses the EMNIST digits data set to train a PyTorch model. Normally, PyTorch makes loading and even downloading data sets extremely easy for us. The torchvision.datasets module provides a handful of commonly used data sets with a user-friendly API. Most importantly for us right now, the data set loaders come with the convenient download=True argument to download a data set automatically:

import torchvision

train_data = torchvision.datasets.EMNIST(
  root="./", 
  split="digits", 
  train=True,
  download=True
)

Unfortunately, that throws a RuntimeError:

RuntimeError: File not found or corrupted.

Next, I wanted to just download the data from a URL via torchvision.datasets.util.download_url(...). I found a handful of EMNIST URLs on the internet, but either got the same old File not found or corrupted or an SSL error.

Fix: Manual download and directory adjustments

Here’s a brief list of steps for downloading the EMNIST data manually and then preparing the directory for torchvision.datasets.EMNIST(..., download=False).

Step 1: Download the files

Go to the official EMNIST website (Link) and head to Binary format as the original MNIST dataset. Alternatively, here’s the link: EMNIST Direct Download Link

That archive with the great name gzip.zip has a size of approximately 500MB.

Step 2: Unpack the gzip.zip archive

Head to your project’s data directory (or global data directory if you have that) and unpack the previously downloaded gzip.zip archive there. You will get a folder gzip/ that contains a whole lot of *.gz files:

.
└── gzip
    ├── emnist-balanced-mapping.txt
    ├── emnist-balanced-test-images-idx3-ubyte.gz
    ├── emnist-balanced-test-labels-idx1-ubyte.gz
    ├── emnist-balanced-train-images-idx3-ubyte.gz
    ├── emnist-balanced-train-labels-idx1-ubyte.gz
    ├── ...
    ├── emnist-digits-mapping.txt
    ├── emnist-digits-test-images-idx3-ubyte.gz
    ├── emnist-digits-test-labels-idx1-ubyte.gz
    ├── emnist-digits-train-images-idx3-ubyte.gz
    ├── emnist-digits-train-labels-idx1-ubyte.gz
    ├── ...
    ├── emnist-mnist-mapping.txt
    ├── emnist-mnist-test-images-idx3-ubyte.gz
    ├── emnist-mnist-test-labels-idx1-ubyte.gz
    ├── emnist-mnist-train-images-idx3-ubyte.gz
    └── emnist-mnist-train-labels-idx1-ubyte.gz
EMNIST splits

You’ll notice a structure: There are different splits, encoded in the filenames as emnist-<split>-.... This <split> corresponds to the split=... argument in torchvision.datasets.EMNIST. For this project, I only needed the digits split, so I deleted the files of all the other splits.

Step 3: Unpack the individual .gz files

Unpack all the *.gz files that you need. On MacOS, the built-in archive tools can handle .gz files, YMMV. Delete the *.gz files after you’re done unpacking. You should have the following structure now:

.
└── gzip
    ├── emnist-balanced-mapping.txt
    ├── emnist-balanced-test-images-idx3-ubyte
    ├── emnist-balanced-test-labels-idx1-ubyte
    ├── emnist-balanced-train-images-idx3-ubyte
    ├── emnist-balanced-train-labels-idx1-ubyte
    ├── ...
    ├── emnist-digits-mapping.txt
    ├── emnist-digits-test-images-idx3-ubyte
    ├── emnist-digits-test-labels-idx1-ubyte
    ├── emnist-digits-train-images-idx3-ubyte
    ├── emnist-digits-train-labels-idx1-ubyte
    ├── ...
    ├── emnist-mnist-mapping.txt
    ├── emnist-mnist-test-images-idx3-ubyte
    ├── emnist-mnist-test-labels-idx1-ubyte
    ├── emnist-mnist-train-images-idx3-ubyte
    └── emnist-mnist-train-labels-idx1-ubyte

Step 4: Adjust the directory structure for PyTorch

If we try to load the data set into PyTorch with the download=False argument now,

train_data = torchvision.datasets.EMNIST(
  root="./", 
  split="digits", 
  train=True,
  download=False
)

we get the following error:

RuntimeError: Dataset not found. You can use download=True to download it

Well, we kind of did all the downloading so that we circumvent the problematic download=True call.

As you might expect, we have to make PyTorch find our downloaded EMNIST data. That’s a two-step process: (1) We will make the EMNIST data fit the format that PyTorch expects; and (2) we will point PyTorch to where our EMNIST data lives.

(1) Required directory tree

PyTorch wants the following structure:

DATASET_NAME
└── raw
    ├── ...-mapping.txt
    ├── ...-ubyte

To achieve this, we simply rename gzip to raw and wrap the entire raw folder into a parent folder called EMNIST. Now your file tree should look like this:

EMNIST
└── raw
    ├── emnist-balanced-mapping.txt
    ├── emnist-balanced-test-images-idx3-ubyte
    ├── emnist-balanced-test-labels-idx1-ubyte
    ├── emnist-balanced-train-images-idx3-ubyte
    ├── emnist-balanced-train-labels-idx1-ubyte
    ├── ...
    ├── emnist-digits-mapping.txt
    ├── emnist-digits-test-images-idx3-ubyte
    ├── emnist-digits-test-labels-idx1-ubyte
    ├── emnist-digits-train-images-idx3-ubyte
    ├── emnist-digits-train-labels-idx1-ubyte
    ├── ...
    ├── emnist-mnist-mapping.txt
    ├── emnist-mnist-test-images-idx3-ubyte
    ├── emnist-mnist-test-labels-idx1-ubyte
    ├── emnist-mnist-train-images-idx3-ubyte
    └── emnist-mnist-train-labels-idx1-ubyte

(2) Point PyTorch to the correct path.

Finally, the call to the PyTorch data loader will work as intended because the EMNIST folder is directly below my current working directory ./:

data_root = "./"

train_data = torchvision.datasets.EMNIST(
  root=data_root, 
  split="digits", 
  train=True,
  download=False
)

If your EMNIST/ folder lives somewhere else (e.g., in a dedicated data/ folder), simply adjust data_root.

Step 5: Profit!

Now off you go and make some fancy machine learning stuff with EMNIST! ✨

–Marvin


Do you enjoy my blog The Training Loop? Subscribe here to get notifications (it's free!):