Thoughts on Origin Pull, S3 and CloudFront

I was chatting with Harper Reed last night about my recently deployed migration of this blog from WordPress to Jekyll. Harper also made a similar migration late last year (though his setup is hosted on Google App Engine whereas mine is on a MediaTemple (ve) box). The general subject of CDNs — in particular Amazon CloudFront — came up as I mentioned I began hosting all my post images on CloudFront and how I had seen an occasional issue where the resources just don't load.

Harper then introduced me to the concept of origin pull.

Origin Pull is the method of transferring data to the CDN automatically from a webserver as opposed to manually uploading the content. [...] For example, CNAME “httplinux.storagelayer.com” is associated with the origin “http://linuxorigin.storagelayer.com/cdn_content”. When a request is made for “http://httplinux.storagelayer.com/images/example.jpg”, the CDN will check its own cache for the file. If the file does not exist or the cache has expired on the file, it will request it directly from “http://linuxorigin.storagelayer.com/cdn_content/images/example.jpg” and will then cache the file based on your webservers cache control configuration. Softlayer

Until only recently, you could not have custom origins with CloudFront. You would typically upload your data to an S3 bucket, then turn that particular bucket into a CloudFront distribution and attach a CNAME if you wish. That S3 bucket was the origin server for CloudFront, and that was the only way to do it.

Now you can set any server to be your origin server for a CloudFront distribution. For example, Cyberduck was updated to include Custom Origin Server support.

So what does this all mean for me? Well, I'm just uploading images, so not much. I like having my media on a separate host, but if I wanted to save the few bucks per month in S3 costs, storing it on the same server and using it as the origin server for CF would work well. It makes more sense for those with dynamic content or use cases where it is best to only send media/files (that may require prohibitive resources to generate) out to the CDN when first requested. For example, you could have a script that dynamically resizes images based on the request from the CDN, then host that image on the CDN. You would just have to fiddle with cache headers so as to not require multiple resource hits to trigger redundant generation of the resource.

Technically, I could host my entire blog on CloudFront since it's all a bunch of static files and Amazon supports a default CF root object. However, I would take a huge SEO hit because I have a bunch of custom Apache .htaccess magic keeping my URLs sane and redirecting old permalink structures. That and CF only works with CNAMES so I can't keep my preferred no-www URL structure without using a different DNS provider to do that redirect. Unfortunately, Amazon's own Route 53 DNS service cannot do this (yet) but an Amazon engineer mentioned something about "magic A record" and zone apex support (the latter only for Elastic Load Balancer users, boo) coming in the future.

The bad part about hypothetically hosting my entire blog on a CDN like CloudFront in the not-too-distant-future? Really ugly and SEO unfriendly 404 pages: "AccessDeniedAccess Denied76909F0642....".

Yeah, these kind of random whimsical "wouldn't it be neat if.." scenarios play out in my head all day long.

Do you use a CDN for any of your sites, web apps or projects?