This repository has been archived on 2023-12-10. You can view files and clone it, but cannot push or open issues or pull requests.
mmp-site/content/posts/week-7.md
2022-03-28 16:46:12 +01:00

2.8 KiB

title date draft
Week 7 2022-03-20T12:40:18+01:00 false

Now that I had successfully run my model without any runtime errors, the next step this week was finding some GPU compute so I can train my model on much more powerful hardware to accelerate the training.

My first idea was to use cloud computing. There are machine learning specific cloud technologies, but I didn't want to use these as I didn't want my code to be dependent on the specific ways cloud platforms want the code in. Instead, I wanted to get a general VM with an attached GPU where I could run my workloads manually. I had already written docker images that contained all the depencies of my code that I could deploy to these VMs to ensure a reproducible and portable environment.

First place I looked was Linode. Although, after I contacted their support about it they said I needed at least $100 of transactions on my account in order to request access to their GPU instances. They also noted I could make a deposit of $100 to start using them straight away. I wasn't sure if my model was going to use up $100 to train yet so I didn't want to risk it.

I then looked to Microsoft's Azure. I had used Azure during my industrial year and had previously passed the fundamentals and associate administrator exams for the platform so felt fairly confident in using it for my project. I ran into some issues I couldn't quite explain at the start of the week. For some reason no services were available for me to use. I couldn't use any VMs, create any networks or drives etc... Turns out I noticed that my UK account was defaulting to trying to create resources in the US which I didn't have access to. So I had to manually set my location to the UK in order to create any resources.

While I was trying to work out the issue with Azure, I looked at using GCP. GCP automatically sets new accounts to have a quota of 0 for GPUs. Meaning you can't attach one to any VM. You can increase the quota which requires getting in contact with customer support. Within 10 minutes I got a response and my quota was increases by 1. I wrote Terraform IaC (Infrastructure as Code) to automatically spin up and spin down cloud resources quickly.

At this point I realised my code wasn't very portable as it included hard-coded absolute paths among other things. I refactored a lot of my code in order to run it on any machine. I was also granted access to the university's GPU compute servers which allowed me to train my model without paying for cloud fees. To make sure my refactoring worked and that the GPU was actually being utilised during training, I rain the training script on the uni's GPU compute server for 20 epochs. It successfully finished in ~13 minutes.