Thursday, September 08, 2011

notMNIST dataset

I've taken some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. Here are some examples of letter "A" Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file. Matlab version of the dataset (.mat file) can be accessed as follows:
for i=1:5
Zipped version is just a set of png images grouped by class. You can turn zipped version of dataset into Matlab version as follows
tar -xzf notMNIST_large.tar.gz
python notMNIST_large notMNIST_large.mat
Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.


Anonymous said...

What does your baseline get on the negated version of the dataset? In other words, make the "ink" pixels have intensity 1 and the non-ink pixels have intensity zero. I would be curious to know if your baseline does better on one version or the other.

Yaroslav Bulatov said...

I don't expect it to make any difference -- pixel level features are learned by stacked autoencoder, and there's nothing biasing to learner to prefer 0's or 1's to start with

Anonymous said...

It makes a difference on MNIST, which is why I asked.

Anonymous said...

WE can use this data even if we do research on this?for instance if we obtain relatively good results
or propose something novel are we allowed to publish anything on it?

Yaroslav Bulatov said...

ML_random_guy -- that depends on whether your country has laws against publishing

Anonymous said...

Hello, it's nice to have such a new challenging dataset. Do you recommend a specific evaluation protocol (number of training/test images) ? Otherwise people will work on different subsets and results will not be directly comparable.

Yaroslav Bulatov said...

Train on the whole "dirty" dataset, evaluate on the whole "clean" dataset.

This is a better indicator of real-life performance of a system than traditional 60/30 split because there is often a ton of low-quality ground truth and small amount of high quality ground truth. For this task, I can get millions, possibly billions of distinct digital glyph images with 5-10% labels wrong, but I'm stuck with small amount of near perfectly labeled glyphs

Anonymous said...

Thanks for the protocol info. Would it be possible to get a tar archive with PNG images of the small dataset like the huge one ? I'm not using Matlab.

Yaroslav Bulatov said...

Oops, small tar should've been in the directory to start with, fixed

osdf said...

I used this dataset to test some of my code and got about 3.8% error rate. Are there more results known for this dataset? A few lines of text are here.

Yaroslav Bulatov said...

Hey, that's pretty impressive! This is the highest accuracy I know. I'm working on a larger dataset to release publicly, but slowed down by some legal clearance hurdles

Yaroslav Bulatov said...

How do you do finetuning? Hinton's contrastive wake-sleep?

goodfellow.ian said...

What is unicode370k.tar.gz?

Yaroslav Bulatov said...

It's a bunch of characters taken from the tail end of unicode values

Nicholas Leonard said...

How did you split your dataset into train,valid,test to get 89%?

Nicholas Leonard said...

Hi, myself and Zhen Zhou from the LISA lab at Université de Montréal trained a couple of 4 layer MLPs with 1024-300-50 hidden neurons respectively. We divided the noisy set into 5/6 train 1/6 valid and kept the clean set for testing. We 97.1% accuracy on the test set at 412 epoch with early stopping, linear decay of the learning rate, a hard constraint on the norm of the weights and tanh activation units. We get approximately 93 on valid and 98 on train. The train set is easy to overfit (you can get 100% accuracy on train if you continue training). One could probably do better if they pursue hyper-optimization further. We used Torch 7.

Georg Friedrich said...

I got with a simple neural network (784,1024,10), whereas the activation functions where RELU and then just a normal softmax. Without activation decay, pre stop, dropout & co and 3001 iterations and a batch size of 128, I got 89.3% accuracy on the test set.

Step: 3000
Minibatch accuracy: 86.7%
Validation accuracy: 82.6%

Finish (after Step 3001):
Test accuracy: 89.3%

Pavlos Mitsoulis - Ntompos said...

Minibatch loss at step 3000: 55.872269
Minibatch accuracy: 79.7%
Validation accuracy: 84.4%
Test accuracy: 90.6%

With a neural network with a single hidden layer (1024 nodes), Relu and l2 regularization.

Phạm T. Lâm said...

Minibatch loss at step 10000: 123.963661
Minibatch accuracy: 45.3%
Validation accuracy: 85.3%
Test accuracy: 91.4%

With a dropout and relu and l2 regularizer, single hidden layer 1024 node.

Alec Karfonta said...

Yaroslav Bulatov,

Thank you for the fun and challenging dataset.

How were the names of the files chosen?

I'm working on renaming each one to the phash value of the image. It looks like the names might already be the result of a hash.

Alec Karfonta said...
This comment has been removed by the author.
Alec Karfonta said...
This comment has been removed by the author.
Alec Karfonta said...

Check out the Udacity course in deep learning, made by Google. They use this dataset extensively and show some really powerful techniques. The goal of the last assignment was to experiment with this techniques to find the best accuracy using a regular multi-layer perceptron. I have a pretty beefy machine: 6600K OC, 2x GTX 970 OC, 16gb DDR4, Samsung 950 Pro; so I set up a decent sized network and let it train for a while.

My best network gets:

Test accuracy: 97.4%
Validation accuracy: 91.9%
Minibatch accuracy: 97.9%

First I applied a Phash to every image and removed any with direct collisions. Then I split the large folder into ~320k training and ~80k validation. I used ~17k in the small folder for testing. Trained on mini-batches using SGD on the cross-entropy, dropout between each layer and an exponentially decaying learning rate. The network has three hidden layers with RELU units, plus a standard softmax output layer.

Here are the parameters:
Mini-batch size: 1024
Hidden layer 1 size: 4096
Hidden layer 2 size: 2048
Hidden layer 3 size: 1024
Initial learning rate: 0.1
Dropout probability: 0.5

I ran this for 150k iterations, took an hour and half using one GPU. Learning pretty much stopped at 60k, but the model never began to overfit. I believe that is because the dataset is so large and the dropout. Even at the end of all that training with a good size network the mini-batch accuracy still did not reach 100% so learning could continue, albeit slowly.

The next assignment is to use a convolutional network, which looks promising. I'll try to post those results too.

Anonymous said...

Could you make your code available? Or at least say which parameters you have use to the exponentially decaying learning rate? Did you use l2 regularization (if yes, with which regularization factor?) I tried to use the same network as you did and it simply doesn't converge.

Gabriel said...

Test accuracy: 98.09%

With a CNN layout as follows:

3 x convolutional (3x3)
max pooling (2x2)
dropout (0.25)

3 x convolutional (3x3)
max pooling (2x2)
dropout (0.25)

dense (4*N)
dropout (0.5)
dense (2*N)
dropout (0.5)
dense (N)
dropout (0.5)
softmax (10)

N is the number of pixel in the images. All layers use relu activation. I also used some zero padding before each convolutional layer. The network was trained with Adadelta. It took ~45 iterations with an early stopping at patience 10. As a final step I ran SGD with the same early stopping and decaying learning rate starting at 0.1. It ran about 15 iterations. Evaluating the network on the training set, the accuracy was 99.07% and 94.25% on the validation set.

杨健程 said...
This comment has been removed by the author.
杨健程 said...

Minibatch loss at step 4999: 0.901939
Minibatch accuracy: 75.0%
Validation accuracy: 87.3%
Test accuracy: 93.3% @step=4999
Model saved in file: save/myconvnet_5000

I used a architecture similar to LeNet, and it seems to be better as step get larger.

Stephen Haptonstahl said...

Where can I download notMNIST? The link above goes to an account that has been suspended.

Stephan Koenig said...

Not sure if this is the complete dataset, but the Udacity course on Deep Learning using notMNIST provides the following links:

Dongseong Hwang said...

Test accuracy: 96.98%

With a CNN layout with following configurations, which is similar to [LeNet5](
However there is little difference

convolutional (3x3x8)
max pooling (2x2)
dropout (0.7)

convolutional (3x3x16)
max pooling (2x2)
dropout (0.7)

convolutional (3x3x32)
avg pooling (2x2): according to above article
dropout (0.7)

fully-connected layer (265 features)
dropout (0.7)

fully-connected layer (128 features)
dropout (0.7)

softmax (10)

decaying learning rate starting at 0.1
batch_size: 128

Training accuracy: 93.4%
Validation accuracy: 92.8%

bf123 said...

Accuracy: 96.1 without convolution (assignment 3 in TensorFlow course)
Using Xavier initialization significantly boosted my results. Network specifications:
1. Batch size = 2048
2. Hidden units: 4096, 2048, 1024
3. Adam optimizer with 0.0001 learning rate
4. Dropout on each hidden layer
5. Xavier initialization

Swapnil Paranjape said...


I'm trying to use tensorflow to do character recognition. I am able to use your dataset(A-J) and get some data from char74k dataset (from K to Z) to train character data and predict. but the char74k set is a pretty limited set and is not enough to get a good accuracy. Have you posted anything similar for characters from K to Z?

그리스 said...
This comment has been removed by the author.
그리스 said...

no convolution, 1 hidden layer 94.4 % with test set

batch size 128
L2 regularization beta 0 (no L2 regularization)
initialize w with deviation 0.03
initialize bias with all 0
Learning rate 0.5 (fix, not decay)
single hidden layer unit # 1024
dropout_keepratio 1 (no dropout)

I'm following udacity tutorial.
It's strange that whenever i put L2 regularization, dropout, Learning rate decay, the test accuracy falls. I can't figure out why.

Adwaith Gupta said...

The test accuracy will fall if you choose a wrong value of regularization parameters. A beta of .005 gives good results.

Adwaith Gupta said...

My accuracy falls slightly after using dropout. Is there a possibility of wrong implementation of tf.nn.dropout() or is it a possible scenario?

vaisakh said...
This comment has been removed by the author.
vaisakh said...

Multi Layer Neural Net without convolution - Test Accuracy = 94.4%

3 Layer Neural Network(No convolution) = input-784, hidden-526, output=10
L2- Regularization with lambda(regularization parameter) = .001
Number of steps = 3000
Batch size = 500

vaisakh said...

2 hidden Layers ( Toal 4 layers ) - without convolution - Test Accuracy = 95.8 %

3 Layer Neural Network(No convolution) = input-784, hidden1-960, hidden2=650 output=10
L2- Regularization with lambda(regularization parameter) = .0005
Number of steps = 75000
Batch size = 1000

Mustafa Mustafa said...

Minibatch accuracy: 93.2%
Validation accuracy: 91.2%
Test accuracy: 96.3%
After 10000 steps.

Two hidden layers:
num_hidden_nodes = 1024
num_hidden_nodes_2 = 100

Both with Relu inputs. Cross entropy + L2 regularization (beta = 1.3e-4).
SGD, batch size 400.
Most importantly, weights were initialized with truncated normal distro. with sigma = 0.01.

Exponential decay starting at 0.5, 0.65 decay_rate every 1000 steps.

stef mt said...

Using Keras on an average gaming laptop with moderate GPU, training took less than 2' on the full (udacity) training set of 200.000 samples, using 10.000 validation samples and measuring accuracy on separate test set of 10.000 samples.
With a simple multilayer network, I reached 96.66%

With KERAS, the code for the network itself is really simple:

batch_size = 128
nb_classes = 10
nb_epoch = 20

model = Sequential()

model.add(Dense(1024, input_shape=(784,)))






history =, train_labels,
batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(valid_dataset, valid_labels))
score = model.evaluate(test_dataset, test_labels, verbose=0)

Zafar Takhirov said...

If you want to generate your own dataset like notMNIST, you should try not_notMNIST

tarun wadhwa said...
This comment has been removed by the author.
Roman Shchekin said...
This comment has been removed by the author.
Roman Shchekin said...

My final result is 96.23% accuracy. Network architecture (built with Keras):



dense(128, relu)
dense(64, relu)
dense(10, softmax)

I used SGD with default params. Also got 92.03% on valid dataset, 92.24% on train dataset. Seems that it is global tendency that test score is higher,

Dominik said...
This comment has been removed by the author.
Dominik said...

97.2% on a fully connected net.

At last iteration, 100k:
Minibatch accuracy: 99.0%
Validation accuracy: 92.2%
Test accuracy: 97.2%

3 hidden layers, 4096 - 3072 - 1024, with relu and 0.5 dropout
Xavier weight init
Batch size 200
Data sets original (200k train, 10k valid, 10k test), no further preprocessing
Loss: softmax_cross_entropy_with_logits + L2 regularization on weights with weight of 1e-4
Learning rate 0.3 with decay of 0.96 every 1000 iterations
Total 100k iterations
[edit - I forgot the dropout on first post]

Sergey Kojoian said...

Test accuracy: 96.12% with only 5000 iterations on a convolutional network with two conv layers and a final fully connected layer.
Minibatch of 50 images was used.

Gianni Casagrande said...

On a very simple 1 hidden layer network without regularization I also get:

Minibatch accuracy: 89.8%
Validation accuracy: 82.9%
Test accuracy: 89.8%

I've seen many other users reporting Test accuracy which is significantly higher than validation accuracy.
Validation and Test are the same size in my case. Is a higher test score reasonable or is it just chance? Should I consider the worst between test and validation as the expected performance of my network?


Anonymous said...

I use 2 hidden layers and GradientDescentOptimizer, but the loss is nan. Why?

Alec Karfonta said...

Try to reduce your learning rate

马真金 said...

96.2% on a fully connected net.

setps, 200000:
batch 200 accuracy: 94.0%
test accuracy: 96.2%

Unknown said...

test acc 98.3%
mini-batch train acc 95.7%
val acc 94.2%

Techniques: "shallow" resnet (used val set to select arch), dropout, horizontal + vertical shift data augmentation, reduce lr on plateaus.


sandipan said...
This comment has been removed by the author.
sandipan said...

Test Accuracy 95.5%
with batch size = 128, number of iterations = 10k
Three 5x5 convoution layers of depth 16, 32, 64 respectively
Three hidden layers with number of hidden nodes 256, 128 and 64 respectively
Dropout 0.7
Learning decay starting with 0.2 learning rate