Getting the Million Songs dataset from AWS hosted S3 bucket

This post is related to my project at UVA, that is, classifying songs into one of thirteen genres using song features and lyrical data. Normally, most Music Genre classification projects use the GTZAN dataset which consists of 1000 songs, with 100 songs each for the 10 genres.

However, we are ambitious people. We found the Million Songs Dataset published by LabROSA group at Columbia. It was freely available on a AWS mounted snapshot.

So, I knew how to connect to AWS EC2 and run a Jupyter notebook on it. For more details check out my post here.

The other part was figuring out how to connect to the snapshot. It turned out to be kind of easy.

Here are the steps:

Choose the Amazon Machine Image of your choice. I chose the most basic one, since I knew I could install Anaconda anytime. It is time consuming though. You can go for one of the Deep Learning AMIs also.
Go to Add Storage. Attach the Million Songs Dataset to your EC2 instance by clicking on Add New Volume and searching for the million songs snapshot. Make note of the device name. Its /dev/sdb/ as shown here.

Then enter the following commands one after the another.

sudo mkdir /mnt/snap

sudo mount -t ext4 /dev/xvdb /mnt/snap

If all goes well it should look something like this:

Et voila, you are done.

Now, see this Jupyter notebook for getting the data into a csv file.

Enjoy Reading This Article?