Keeping the old blog/website around forever

I have a bad habit of registering domains and putting sites up. Some sites I want to keep around to be able to share a trip or whatever with people, but I don’t want to bother with managing a server or deal with keeping WordPress, Ghost, Django CMS, or whatever up to date. It’s time to put your site into internet archive mode.

Before you scrape

Before you get into scraping your site, you will want to shut down any dynamic parts of your sites such as comments, user registration, or anything else like that. If you want to keep comments around, you may want to consider a service like Disqus.

Screen scraping your own site

After doing a fair amount of research, and even after looking at building my own tool (just for educational purposes), I found out that it’s dead simple to crawl and scrape an entire website easily with wget. Here’s the exact command I use:

wget --recursive --convert-links --page-requisites example.com

You can read the docs for full details, but to help out with the specific flags used:

  • --recursive: It will iterate through all links of your site and pull them down as well.
  • --convert-links: This will convert links to work locally or in different paths.
  • --page-requisites: Pulls down any additional files necessary to properly display a page.

Moving to Amazon S3

Take everything you’ve pulled down with wget and put it into a S3 container setup for static website hosting. If you want HTTPS support for your site, setup CloudFront to serve your site and use the AWS Certificate Manager to setup your certificate. The benefits here are pretty strait forward.

  1. You don’t have to manage a server or servers anymore.
  2. Traffic spikes are handled for you by S3 and CloudFront, no more Auto Scaling groups.
  3. You don’t have to update your CMS or it’s plugins anymore to deal with vulnerabilities.
  4. No more expiring SSL certs! You don’t have to manage your own SSL certs anymore with letsencrypt or other provider, ACM handles management and renewals automatically.