How To: WordPress to Jekyll

Over the winter vacation I relaxed and spent time away from thinking about the startup fundraising process. I had a latent interest in Jekyll ever since Tom Preston-Werner (Cofounder of GitHub) created, released and wrote about it in Blogging Like a Hacker. I finally got the chance to sit down and fully grok Jekyll. I decided to migrate my site from WordPress to Jekyll. I designed a new layout, imported my database and created some custom features to get it all to my liking. If you're reading this in an RSS reader, you will want to click through to see the new site.

PaulStamatiou.com redesign with Jekyll Yup, still love my Kindle.

(Disclaimer: At over 8,000 words this is the longest post I have written (runner up). If you have any questions or if anything in here is presented in a confusing manner, or is just plain wrong, please don't hesitate to leave a comment or send me an email. I will clean up my Jekyll blog repo soon and make it public.)

Why the big move?

Like any other hacker I just wanted to learn a new tool.

I have been running this blog on WordPress for about 5.5 years. I first ran into WordPress when I was running a MediaWiki-powered website about computer modding and was curious about other CMSs. I fell in love with WordPress and set it up on my 1.42GHz G4 Mac Mini (which I had overclocked to a whopping 1.5GHz by unsoldering two SMT resistors). My first few months of blogging were surreal — several of my posts made it on digg and Lifehacker in 2005. I thought that was the coolest thing ever. I continued writing.

I soon received enough traffic to kill the 2.5-inch hard drive in my Mac Mini, which had been hosting this blog in my Georgia Tech dorm. It was around this time that I began modifying my site and picking up basic web design and development skills. I moved my blog over to Media Temple and have been happy there since. I am now on a developer-aimed VPS called the ProDev (ve) — 4GB of RAM on a dual quad-core 2.26GHz Xeon with lucid-flavored Ubuntu.

I am not leaving WordPress for any frustrations or problems. WordPress is a very capable and extensible CMS. Most anything you want done with WordPress is a search away. If you think your WP-powered site is slow, there are a number of fixes for the speed freaks, including generating static files and working with a CDN. The WordPress community has been amazingly helpful. I went to WordCamps, sponsored one and helped out other WP users through tutorials and forums. A big thank you to folks like Matt Mullenweg, Michael Heilemann and Mark Jaquith.

Content is king

I wanted to start from scratch and completely rebuild this blog. This time around I was going to focus on content. Over the last few years I have experimented with monetizing this site, as you may have noticed. I tried various affiliate programs, CPA ads, AdSense, Amazon Associates, RSS ads and private sales. It was starting to work. I wasn't the next John Chow but I was making 3,000 a month during this blog's financial peak around mid-2009.

Then it went down to a few hundred a month over the next year and a half. There are a few reasons for that:

I began posting less and less due to increasing startup obligations. That's to be expected. This blog is a great side-project and hobby but not a full-time job.
This blog had become known for in-depth, long-form content and as such I didn't blog unless I knew I had enough substance to make for an interesting article. Things I would have written about, but to a lesser extent, would go unpublished entirely.
I made a few SEO mistakes that knocked down my PR, and thus traffic, down. The first was using an early version of bbPress to run forums here. That version did not rel="nofollow" links to websites on user profile pages. Tens of thousands of spammers signed up and those profile pages linked to all sorts of sites. Google did not like that. My second mistake was running a translation plugin that generated every article in many languages. It eventually got to the point where Russian-translated posts would rank higher than the English posts and it became impossible for readers to find articles. I couldn't even find my own posts on Google, even when I knew the exact title. Third, I had some mysterious redirection issue that plagued my site for too long. Pages would randomly redirect. I couldn't reproduce it, Media Temple couldn't reproduce it. But I got tons of reports about it from my readers. I thought I had fixed it but it still occurred once every 50 or so page loads. Google ended up indexing articles with different URLs. It was a mess.
Google noted that my site loaded 88% slower than other sites. It was partially all the ads my site was running and partially all the images I put in my long posts.

Traffic also dwindled during this period. It was time to reboot. Whatever I was going to do needed to change all this and get me blogging more, one way or another. I fixed that with bits. More on that later.

What is Jekyll?

Oops, guess I didn't explain exactly what it is yet. Here's what the repository says:

Jekyll is a simple, blog aware, static site generator. It takes a template directory (representing the raw form of a website), runs it through Textile or Markdown and Liquid converters, and spits out a complete, static website suitable for serving with Apache or your favorite web server.

Jekyll is not really a CMS. There is no admin panel to edit, write and manage posts. But there is vim, emacs, TextMate, gEdit, Redcar, Notepad++ or your text editor of choice. In a nutshell: write your posts in markdown or textile (or in my case just keep using HTML like I've always been using in my posts and keep it future-proof), run jekyll and it will create a site directory filed with beautiful static, flat HTML files. You don't need even a database on your server...

sudo apt-get remove mysql-server-core-5.1

I freaking love having my entire site in static files. It's a nerd feeling that's hard to explain. As a Mac user, I just need to activate Spotlight with two keystrokes and I can instantly find any old blog post.

Searching for Jekyll posts in Mac OS X Spotlight

You'll want to tell Spotlight to ignore the _site directory. Seductive wallpaper by Silk.

Or while I'm in my editor of choice I can activate PeepOpen to find and open any post. Or I can run Ack in Project in TextMate to find any string inside of my posts. Or maybe I was curious and wanted to list the top 5 posts by word count?

~/jekyll-blog/_posts(master)  wc -w * | sort | tail -n6
    4182 2010-03-05-review-23andme-dna-testing-for-health-disease-ancestry.markdown
    4189 2010-02-17-live-blogging-startup-riot-2010.markdown
    4777 2009-06-04-review-2009-lincoln-mks-with-microsoft-sync.markdown
    4888 2010-01-14-how-toreview-surf-securely-with-vyprvpn.markdown
    6393 2009-11-27-review-2011-ford-fiesta-and-the-fiesta-movement.markdown
  576265 total

Flat files are just cool.

The Plan

After I was sure that I wanted to embark on this journey I had to think about how this would all work and what sacrifices I would have to make. I would need to implement some custom stuff to get some features and pages I was used to with WordPress. It was also important that I kept the exact same URL structure.

Here was the initial list of tasks that had to be completed/built:

Import WordPress database and retain tags
Move all images to Amazon CloudFront and rewrite posts to use new image URL
Make HTML files for regular pages and 404/500
Make a search page using Google Custom Search
Be able to parse 's used in WP posts and show post up to that tag for previews.
Create entire new layout and use Typekit because it makes the wannabe-designer in me happy.
Figure out layouts and includes for various parts of the site
Tags and individual tag pages + fix Jekyll issue with not supporting tags with spaces
Archives listed by month/year and individual archive pages to keep URLs like /2011/01
Sitemap that lists posts, archive pages and tag pages
Create a feed template and ensure it correctly redirects to FeedBurner
Create include in feed so I can put RSS ads at some point
Compass for Sass
List related posts
Do lots of .htaccess work to make sure URLs are as close to the old structure as possible and use link rel="canonical" where appropriate.
Ditch Mint for web stats and go database-free with something like Chartbeat or Reinvigorate (I also have Google Analytics)
Migrate all comments to Disqus
Get next/previous post links working
Create new section of the site for short-form content, create separate archives page and feed
Be able to put custom meta descriptions from content in YAML front matter in posts if wanted.
Write a rakefile to ease some routine tasks like generation and deploy
miscellany...

A lot of work was ahead.

Getting started with Jekyll

The first thing I did was create a new GitHub repository for the blog. Then I had to begin creating the file and directory structure Jekyll expects. Here's what my Jekyll directory looks like:

~  tree ~/jekyll-blog/ -L 2
/Users/Stammy/jekyll-blog/
├── 404.html
├── 500.html
├── README.markdown
├── Rakefile
├── _bits
│   └── # some of my bit-style posts, not a standard jekyll feature
├── _config.yml
├── _drafts
│   └── # ideas for next blog posts/bits
├── _includes
│   ├── bit_listing.html
│   ├── comments.html
│   ├── cta.html
│   ├── header.html
│   ├── load_last_js.html
│   ├── nav.html
│   ├── post_footer.html
│   ├── post_listing.html
│   └── rss_footer.html
├── _layouts
│   ├── bit.html
│   ├── default.html
│   ├── home.html
│   ├── page.html
│   └── post.html
├── _lib
│   └── wordpress_import.rb
├── _posts
│   └── # lots of posts
├── _site
│   └── # generated site goes here. don't manually put anything here
├── about.html
├── apple-touch-icon.png
├── archives.html
├── bits-feed.xml
├── bits.html
├── config.rb
├── contact.html
├── favicon.ico
├── feed
│   └── index.xml
├── index.html
├── robots.txt
├── sass
│   └── screen.sass
├── search.html
├── sitemap.xml
└── stuff-i-use.html
11 directories, 2318 files

Rather than creating all these files from scratch, a good first step is to fork someone else's Jekyll blog and modify as you see fit. My Jekyll repository isn't public yet, I still have a lot of cleaning up to do but Tom Preston-Werner's Jekyll repo is popular for forking. Just be sure to remove all of his posts and images before you publish your new site.

The directory tree listed above is not what you put in your web server's public html directory. Instead, you point the web server to the _site directory. That's where the entire generated site and static HTML files are stored. Posts written in markdown, textile or regular HTML go in _posts while _layouts and _includes are reserved for HTML/Liquid markup layouts for various content types (page, post, homepage, etc) and any necessary HTML fragments/partials, respectively. As you can imagine, the _drafts folder is where you can stash posts you don't want to be generated and published until you're ready to move them to the posts folder.

Jekyll pays close attention to files that contain YAML Front Matter and Liquid template tags. A YAML front matter block at the beginning of any file can contain custom page variables as well as predefined ones such as: layout, title, date, tags and categories.

---
layout: post
title: "How To: WordPress to Jekyll"
tags:
- wordpress
- jekyll
- ruby
- github
description: "This is a custom variable that I access via Liquid..."
---

Liquid on the other hand is a markup language created by the Shopify folks that makes for easy layout creation. Liquid tags are either bound by curly braces and modulos, or double curly braces. The latter is for outputting content while the former is for conditionals and setting up loops. Here is part of my index.html file: Jekyll layout: liquid yaml example

You can see it has a bit of YAML front matter, then it includes a file (my yellow "call to action" bar) that is stored in the _includes directory, then creates a few post loops and outputs content. I have two loops here because I want the first post displayed differently (that's what the post_listing.html include is for) and then the rest displayed in a simple list.

Here's what the post_listing.html include looks like:

Jekyll liquid post include

While in the site.posts loop, this include has access to the template data for each post. There are some sections where variables are piped through filters. Several are included with Jekyll, such as date_to_string.

You can also specify no layout with "nil" and still access template data. For example, here's how I created my atom feed.

That's the basic gist of Jekyll layouts and templating. My particular setup is a bit more complex with 5 layouts and 9 includes, whereas Tom's blog contains just two layouts and no includes. Learn from his setup and expand as you see fit! I'm purposely being a bit brief here as getting the layout setup is pretty straightforward.

_config.yml

Jekyll's config file is a good place to start while building out your site. The default configuration and various configuration settings are explained on the Jekyll wiki. The default settings should be satisfactory, but you'll want to set markdown to rdiscount (explained later) and adjust the permalink style.

You may opt to put in various custom variables like I did with description and root_desc that I use in various parts of my layouts. Also, base_url is handy. When you're developing locally you can keep it to forward slash so that when the site is generated it links to other local posts, but when you're ready to deploy live adjust the url to your domain. There is no definitive argument for why you need to include the full domain versus relative, SEOs go both ways on it, but I simply prefer including the entire url.

multiviews: true
#only works with https://github.com/stammy/jekyll, explained below

source:      .
destination: ./_site
includes:    ./_includes

pygments:    true
markdown:    rdiscount
permalink:   /:title.html

base_url: https://paulstamatiou.com
#base_url: /
description: "A blog covering Tech News, Reviews, Guides and Startups from developer and startup guy Paul Stamatiou."
root_desc: "PaulStamatiou.com - Tech News, Reviews and Guides"

rdiscount:
  extensions: []
  
exclude: ['Rakefile', 'README.markdown', 'config.rb']

Local Dev Environment

By now you'll want to actually install Jekyll itself. If you don't plan on doing any Jekyll hacking and will just be using it as is, you can just use the ruby gem:

sudo gem install jekyll

If think you'll be doing some Jekyll hacking of your own, or using someone else's fork (there are tons of great Jekyll forks to be found), it's best to fork, clone and add the path to your bash profile. For example, cloning and running my Jekyll fork:

~  mkdir -p ~/Projects/jekyll-stammy
~  git clone git@github.com:stammy/jekyll.git ~/Projects/jekyll-stammy
Cloning into /Users/Stammy/Projects/jekyll-stammy...
remote: Counting objects: 3341, done.
remote: Compressing objects: 100% (1727/1727), done.
remote: Total 3341 (delta 2053), reused 2625 (delta 1482)
Receiving objects: 100% (3341/3341), 379.48 KiB, done.
Resolving deltas: 100% (2053/2053), done.

Now you just have to add that freshly-cloned Jekyll to your path.

vim ~/.bash_profile

Add this line near the end, editing the path and directory name for your fork accordingly:

# Jekyll (use dev version, not gem)
export PATH=/Users/Stammy/Projects/jekyll-stammy/bin:PATH

Then run source ~/.bash_profile. Or you could use caret quick substitution: ^vim^source. It edits the last entered command but replaces "vim" with "source" and runs it. Handy tip for your bash repertoire.

RDiscount

Discount is a fast C implementation of John Gruber's Markdown markup language while the RDiscount extension makes this Discount Markdown processor available via a Ruby C Extension library. If you write posts in Markdown, you will need RDiscount to process the markup and convert the post to HTML. I don't usually write in Markdown, but like that it does basic things like add <p>'s for separate lines. There are other options like Maruku instead of RDiscount but I would not recommend it. Maraku was slow for me and didn't know how to render some of my post markup, resulting in errors like "REXML could not parse this XML/HTML". Stick with RDiscount and you should be fine:

sudo gem install rdiscount

If you use RDiscount you'll have to use Pygments for code syntax highlighting (what I use in this post). There are other options as well, such as CodeRay, which appears to only work with kramdown (a pure-Ruby Markdown converter that is slower than RDiscount, though CodeRay is faster than Pygments). Or you can just use GitHub Gist embeds for all your code needs.

I went with RDiscount + Pygments was my choice and I've been happy with it so far. You'll need the easy_install (a package manager like RubyGems but for Python) to install Pygments if you do not have it yet.

sudo easy_install Pygments

jekyll.dev

The next thing I did was setup Jekyll's _site directory with Apache on my MacBook Pro for easy local development. Jekyll comes with its own web server that's great for local testing (jekyll --server) but I prefer setting up a vhost.

You probably know the drill with adding directories and virtual hosts to Apache but here's a refresher. Open up the Apache conf file in your editor of choice. This can be found at /etc/apache2/httpd.conf in OS X. Add the sections below, editing the name of your jekyll directory and ServerName as necessary. The VirtualHost section usually goes at the very end of httpd.conf. If you want to be in good form you can create another .conf file altogether instead of editing the main Apache conf, but I digress.

<Directory "/Library/WebServer/Documents/jekyll-blog/_site">
    Options Indexes FollowSymLinks MultiViews
    AllowOverride All
    Order allow,deny
    Allow from all
</Directory>
<VirtualHost *:80>
    ServerAdmin webmaster@dummy-host2.example.com
    DocumentRoot "/Library/WebServer/Documents/jekyll-blog/_site"
    ServerName jekyll.dev
    ErrorLog "/private/var/log/apache2/jekyll.dev-error_log"
    CustomLog "/private/var/log/apache2/jekyll.dev-access_log" combined
</VirtualHost>

The jekyll-blog directory doesn't actually live in /Library/WebServer/Documents/ on my setup; I prefer to have jekyll-blog in my home directory and just symlink it into the WebServer Documents folder.

/Library/WebServer/Documents  ln -s ~/jekyll-blog/ jekyll-blog
/Library/WebServer/Documents  ls -l
lrwxr-xr-x   1 Stammy  admin   26 Nov 19 13:44 jekyll-blog -> /Users/Stammy/jekyll-blog/

Then add the line below to the end of your /etc/hosts file.

127.0.0.1       jekyll.dev

Restart Apache and try heading to http://jekyll.dev in your browser. You should get some generic Apache page if you don't have any files in the _site directory yet. When you generate the site you can start browsing the complete site locally.

Compass for Sass

I'm a huge Sass advocate and have been using it to generate my CSS for the last two years. Compass, a popular "Sass-based CSS Meta-Framework", was one of the first things I setup when designing the new site. For those new to Sass, here's how the official site describes it:

Sass makes CSS fun again. Sass is an extension of CSS3, adding nested rules, variables, mixins, selector inheritance, and more. It’s translated to well-formatted, standard CSS using the command line tool or a web-framework plugin.

sudo gem install compass

Then create config.rb and place it in the site root.

http_path = "/"
css_dir = "_site/sass"
sass_dir = "sass"
output_style = :compressed

Run compass compile or compass watch to generate the CSS. Later in this post I share a rake task for site generation that compiles the Sass too.

Importing WordPress Posts

Now that you know how Jekyll processes files and generates the site, it's time to import your database. A set of migration scripts already comes with Jekyll and currently supports CSV, Drupal, Marley, Mephisto, MovableType, TextPattern, Typo, WordPress and WordPress.com. I ended up slightly modifying another user's custom WordPress migration script that added the ability to add tags to posts. My tweak dealt with rewriting image URLs in my posts:

# Process the content and replace URLs pointing to wp-content/uploads/
# with my CloudFront CNAME'd URL turbo.paulstamatiou.com/uploads/
def self.transformUrls(domain,content)
	baseurl = "%s/wp-content/uploads/" % domain
	return content.gsub(baseurl,"turbo.paulstamatiou.com/uploads/")
end

Here is the complete WordPress importer script I used. It's easiest to copy your database to whichever computer you'll be running the script from and importing it into MySQL then providing the script with those database credentials. After running the script, I had a new _posts folder filled with all of my posts in markdown files with the correct YAML front matter including tags and title. The date was not placed in the YAML but is present in the name of the file (ex: 2011-01-20-my-post-slug.markdown), which is used when generating the site. However, if you post many times per day, the date in the slug is not specific enough and you might run into issues where Jekyll doesn't know which order to display posts published on the same day. To fix that you'll want to edit the importer to include a timestamp in the YAML for each post. I believe Harper Reed's migration script does just that.

CloudFront for Images

As for how I was going to move 460MB of images from my server to Amazon S3 for use with CloudFront, I used a nifty command line tool on my server called s3cmd. But it can be done easily via drag and drop with something like Cyberduck or Transmit. Just remember to change the ACL such that all images are publicly viewable. If you opt for the s3cmd route, after installing via brew or apt-get run s3cmd --configure to get started.

PaulStamatiou.com redesign with Jekyll

Structure of my S3 bucket deployed as a CF distribution. For CF distribution details, type s3cmd cfinfo, find the distribution ID, then try s3cmd cfinfo cf://[ID]

After the initial big upload, I wrote a task in my rakefile to make it easy to upload new images for a post. When I write a post I usually have a temporary folder on my desktop called new_post where I put all the images I want to use in the new post. I often link images to larger versions of the images and wanted this task to detect similar file names (example_img.png and example_img_1200.png, with the latter being a 1200px wide version of the former) and generate the proper HTML for an image linked to a larger version.

Example of the types of filenames in the new_post folder:

~/Desktop/new_post  ls
new_screenshot.png
new_screenshot_1100.png
single_image.png
superlarge.png
superlarge_1900.png
test_screenshot_regular.png
test_screenshot_regular_1200.png
yet_another_single_image.png

Now I just run rake cloudfront to upload the images with the proper ACL, clean up the filename, insert alt/title tags, detect different versions of images and provide me with the code for easy copying in TextMate. I know this is tied to my particular blogging workflow and may not apply to everyone but I wanted to share as it saves me lots of time.

desc 'upload imgs to cloudfront'
task :cloudfront do
  puts 'uploading images in ~/Desktop/new_post/ to cf'
  post_dir = "/Users/Stammy/Desktop/new_post/"
  month = Time.new.strftime("%m")
  year = Time.new.strftime("%Y")
  sh "s3cmd put --acl-public --guess-mime-type #{post_dir}* s3://pstam-cloud/uploads/#{year}/#{month}/"
  
  # create URLs for handy copying
  # detect large version of same image and link it to smaller version
  # or just provide img src to orphan if no larger version
  # works b/c Dir.glob returns files alpha by extension
  # yes I know THIS IS DIRTY, I'll refactor later...
  puts "Uploaded. Here are your CF URLs \n\n"
  Dir.chdir(post_dir)
  img_urls = ''
  images = Dir.glob("*.{png,gif,jpg}")
  images.each_with_index do |image, index|
    cur = image
    desc = cur.gsub('pstam_','').gsub('_',' ')[0...-4].capitalize
    if !images[index+1].nil?
      nxt = images[index+1]
      if cur.gsub(/_1[0-9]00/,'')[0...-4] == nxt.gsub(/_1[0-9]00/,'')[0...-4]
        if /_1[0-9]00/.match(image)
          large = cur
          small = nxt
        elsif /_1[0-9]00/.match(nxt)
          large = nxt
          small = cur
        end
        img_urls += <<-HTML
<div class="center"><a href="https://turbo.paulstamatiou.com/uploads/#{year}/#{month}/#{large}" title="#{desc}"><img src="https://turbo.paulstamatiou.com/uploads/#{year}/#{month}/#{small}" alt="#{desc}"/></a></div>\n
HTML
      elsif !(/_1[0-9]00/.match(image))
        img_urls += <<-HTML
<div class="center"><img src="https://turbo.paulstamatiou.com/uploads/#{year}/#{month}/#{cur}" alt="#{desc}"/></div>\n
HTML
      end
    else
      # if last
      img_urls += <<-HTML
<div class="center"><img src="https://turbo.paulstamatiou.com/uploads/#{year}/#{month}/#{cur}" alt="#{desc}"/></div>\n
HTML
    end
  end
  puts img_urls
  filename = (0...8).map{65.+(rand(25)).chr}.join + "_imgurls_tmp.txt"
  path = File.join("/tmp", filename)
  File.open(path, 'w') do |file|
    file.puts img_urls
  end
  system "open -a textmate #{path}"
end

Disqus for Comments

Jekyll is all about static files so I can't do anything like serve my own commenting system. I decided to migrate my ~25,000 WordPress comments to the popular Disqus commenting system. I was worried this would be a long and painful process but was actually surprised at how easy it was. I simply installed their WordPress plugin and told it to migrate my comments to Disqus. The process did take a while — about 10 hours — until I noticed all the comments for each posts were properly loading. Comments that were threaded in WordPress were properly threaded in Disqus. Sweet!

As long as I kept the post URLs the same, there would be no problem adding Disqus to the Jekyll site. I created a comments.html include of the Disqus embed code that I put in my post layout.

That's all there was to it! There is one slight drawback, or plus depending on how you view things, to this approach. Disqus loads all comments after page load, via JS. This means that comments will not be indexed by Google. That's good if you write with SEO prowess and don't want user comments mucking up your perfect mix of keywords. That's bad if you're like me and think that user comments add tremendous value and want others to be able to find posts while searching for something mentioned in a comment.

Website Analytics

For the last few years I ran both Google Analytics and Mint. Google Analytics tends to be my "backup" analytics logging tool. I don't really check it too often but I like knowing that it's there keeping track of everything. I used Mint to simply look at more recent traffic patterns, popular referrers for the day and so on. I would check it more often than Google Analytics; up to maybe 5 times on a new post day.

With this site migration I decided it was time to lay Mint and it's MySQL database to rest. I didn't want to run a mysqld process anymore. I decided to sign up for both Chartbeat and Reinvigorate until I decided which one I liked more. Both cost roughly 10 per month at my tier. I have been using both for about a month. I'm not in love with either of them at the moment. Chartbeat has a neat dashboard with real-time data but makes it hard to get basic information like unique visitors and pageviews per day. I know that's not their target metric but it would be nice to add. It's like selling me a car that tells me current MPG but not average MPG.

Chartbeat

Reinvigorate on the other hand does not give off quite the real-time vibe as Chartbeat does (and for some reason Reinvigorate reports roughly half as many active visitors as Chartbeat does; guess they have different definitions for active visitor). Reinvigorate has loads of data, much like Google Analytics, and you can get access to hourly, daily, monthly traffic, heatmaps, visitor details, top referrers, keywords and more, but it's spread out over some 20 pages and will take you an afternoon of clicking to find what you're looking for.

Maybe I'll just stick to Google Analytics and invest the money saved in a nice low expense ratio index fund. Or a haircut.

Features vs Generation Time

In the end, even after I built out complete archives pages and tag pages, I ended up ditching them entirely. Why? For simplicity and in interest of keeping site generation time minimal. With all these features and extra pages to generate, it took Jekyll 50 minutes to generate my site. That was 50 minutes between me and publishing a new post, changing something in the layout, et cetera. Running through 1,100+ posts and hundreds more archive and tag pages processing markdown, pygments and liquid is no easy feat. Jekyll is not made for large sites.

I ended up taking that restriction and using it for the better. Did I really need tags and individual archive pages? I asked a bunch of people on Twitter whether they used tags for site navigation. It came back as a resounding no. Most people considered them clutter. Search is the killer app now, no need for tags in my opinion. I yanked them all out and 301 redirected tag and individual archive pages to my single archives page.

Site generation time went down to around 6 minutes on my 2.8GHz Core 2 Duo after I took them out.

~/jekyll-blog(master)  rake generate
(in /Users/Stammy/jekyll-blog)
time jekyll
Configuration from /Users/Stammy/jekyll-blog/_config.yml
Building site: . -> ./_site
Successfully generated site: . -> ./_site
      395.46 real       249.75 user        86.82 sys
compass compile
   exists _site/sass
  compile sass/screen.sass
   create _site/sass/screen.css

Generation by this task in my Rakefile:

desc 'nuke, build and compass'
task :generate do
  sh 'rm -rf _site'
  jekyll
end

def jekyll
  # time to give me generation times
  # I'm just curious about how long it takes
  sh 'time jekyll'
  # compass already configured via config.rb in root
  sh 'compass compile'
end

Before this run-in with archives and tags I got related posts ("Latent Semantic Indexing") working after compiling and installing GSL with rb-gsl. It took a while to generate the list of related posts when I only had a handful of posts in my local Jekyll environment. When I put all my posts in and tried to generate them it took longer than 10 hours. I don't know exactly how long because I tried it twice and killed it after 10 hours — that wasn't going to fly and I decided to just list recent posts instead. I had considered spinning up a large EC2 instance to generate it but doing that each time I had a new post was going to be a pricey nuisance.

For those with fewer posts interested in implementing tag pages, I made a rake task for it as shown in this gist. Getting individual archive pages working required adding some of the archive support built by Mike West into my Jekyll fork. While I didn't end up using the full archive page support, it did allow me to organize the post listing in my single archives page by month and year (mentioned below).

Custom Features

A few features I did end up implementing and keeping include a second post type called "bit", MultiViews support, a filter to recognize WordPress "more" tags and collated posts.

MultiViews

There are two main ways of getting Jekyll to create permalinks. In the _config.yml you can either set permalink to something ending in .html or not. If the permalink structure ends in .html, Jekyll will end up generating posts as html files and dump them directly in _site and Apache will serve them as yoursite.com/your-post-slug.html. Jekyll will also link to posts on the site with the .html extension (that's what putting post.url in a posts loop will output).

If you set the permalink structure without any html extension, Jekyll will generate a ton of index.html files stored within their own directory named the slug of the post. Apache will serve it without any extensions as well, but will by nature keep a trailing slash since it is loading the index.html file inside of the directory. For example: _site/some-long-post-slug/index.html => yoursite.com/some-long-post-slug/

Alright Paul, so where's the issue?

I don't want a bunch of long name directories with index.html files. It makes it hard to search for posts locally if everything comes up as index.html. Just having post html files and less directories is much easier to deal with IMO.
I don't want permalinks to end in .html
I also don't want a trailing slash on permalinks (which is what happens with the index.html route)
I want Jekyll to generate post links without the .html extension even though I told it in the config to use .html

MultiViews is an Apache feature aimed at content negotation — serving up files for resources that don't exist. So even though /long-post-slug doesn't exist, Apache will end up serving /long-post-slug.html.

The effect of MultiViews is as follows: if the server receives a request for /some/dir/foo, if /some/dir has MultiViews enabled, and /some/dir/foo does not exist, then the server reads the directory looking for files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. Apache docs

I already set up MultiViews in the Apache configuration (you can also set it in .htaccess) so the only pieces left are 1) coaxing Jekyll into processing post urls without the html extension and then 2) having Apache redirect post-slug.html to the extension-free post-slug (otherwise both versions would load and Google would index both, detect duplicate content and spread PageRank amongst both.. not very canonical).

Fortunately both were a quick fix away. I ran across Henrik's Jekyll fork where he introduced a MultiViews setting in _config.yml and then rewrote the url method to remove the extension if multiviews is enabled and placed the url logic in another method. I applied the same method in an updated Jekyll (v0.10.0). Just set multiviews: true in the config file.

And finally, some a few .htaccess lines to take care of the duplicate urls:

# External redirect any /post-slug.html to /post-slug
RewriteCond %{THE_REQUEST} ^[A-Z]+\s([^\s]+)\.html\s
RewriteRule .* %1 [R=301,L]

`` content filter

This allows me to return just the part of the post before the more tag in my templates. For example, I wanted to use this on tag pages, archive pages and on new posts on the homepage. Alternatively, if you don't use or don't want to use the more tag in your posts, you can get the same effect with something like {{ post.content | truncatewords: 75 | textilize }}.

I added this to filters.rb in my Jekyll fork:

# Returns all content before the first-encountered WP-style MORE tag.
# Allows authors to mark the fold with an  in their drafts.
# ex: {{ content | before_fold }}
def before_fold(input)
  input.split("").first
end

Bit post type

I wanted a post type similar to an aside but wanted it to remain entirely separate from regular posts. Bits would not share tags, be listed in the main RSS feed, et cetera. Something like this could have been done by adding another field to the YAML front matter in each bit and checking for the presence or exclusion of that value while looping through posts, or by simply making a bits folder and manually adding posts there, but then I wouldn't be able to loop through them for a bits archive, feed or the sitemap.

For ease of use, cleaner logic and faster generation times (less stuff for Liquid to do) I decided make a Bit class. It's essentially a direct copy of the Post class with appropriate variable changes/additions made throughout the Jekyll.

View bit.rb on GitHub.

Collated posts

And last but not least, I wanted slightly better archive pages. I didn't just want a list of every post. I wanted them broken up into sections for year and month.

This snippet, among some related archive code, was added to the render method in site.rb:

self.posts.reverse.each do |post|
  y, m, d = post.date.year, post.date.month, post.date.day
  unless self.collated.key? y
    self.collated[ y ] = {}
  end
  unless self.collated[y].key? m
    self.collated[ y ][ m ] = {}
  end
  unless self.collated[ y ][ m ].key? d
    self.collated[ y ][ m ][ d ] = []
  end
  self.collated[ y ][ m ][ d ] += [ post ]
end

That allowed me to use this crazy markup to create the archives page:

Jekyll Archives for PaulStamatiou.com View the complete archives file in this gist.

Performance

By nature, Jekyll will be fast — as fast as your nginx, Apache or other web server setup can dish out tiny static html files. By offloading all image resources to a CDN, I reduced the amount of HTTP requests the server has to reply to for a single page load. I could have also done the same thing with my Sass-compiled CSS but I change it so often that I prefer having it served from my server rather than dealing with CDN cache invalidation and versioning issues. I ended up keeping Apache as my web server; I don't get the kind of traffic to warrant an nginx setup. My box can handle a 50,000 pageview day no problem and that's the most I've seen from any sort of Hacker News/Reddit/&c fiasco.

I decided to install the much-hyped Apache 2.2 module by Google called mod_pagespeed:

# confirm this Ubuntu server is x86_64
~  uname -a
Linux paulstamatiou.com [...] x86_64 GNU/Linux	
# yup, good to go with 64-bit mod_pagespeed
~  wget https://dl-ssl.google.com/dl/linux/direct/mod-pagespeed-beta_current_amd64.deb
2011-01-24 19:43:02 (1.83 MB/s) - `mod-pagespeed-beta_current_amd64.deb' saved [764256/764256]
~  sudo dpkg -i mod-pagespeed-*.deb
~  sudo apt-get -f install

Pagespeed should be installed and you'll see some new files in the /etc/apache2/mods-enabled/ directory. Now let's take the red pill and see what kind of configuration options are available. Open up pagespeed.conf and uncomment/enter these lines:

ModPagespeed on
AddOutputFilterByType MOD_PAGESPEED_OUTPUT_FILTER text/html
AddOutputFilterByType MOD_PAGESPEED_OUTPUT_FILTER text/css
ModPagespeedRewriteLevel CoreFilters
ModPagespeedEnableFilters collapse_whitespace,elide_attributes

Read about more mod_pagespeed filters and settings in the docs; this is only scratching the surface. In particular, take a look the rewrite_images filter as well as ModPagespeedDomain if you do any CDN stuff. Pagespeed can also provide basic statistics if you enable the following:

ModPagespeedEnableFilters add_instrumentation

<Location /mod_pagespeed_beacon>
      SetHandler mod_pagespeed_beacon
</Location>

<Location /mod_pagespeed_statistics>
   Order allow,deny
   Allow from localhost
   # Add your IP below and uncomment to be able to view remotely
   # Allow from XXX.XXX.XXX.XXX
   SetHandler mod_pagespeed_statistics
</Location>

After you're done fine-tuning pagespeed settings save the conf file and then restart Apache:

sudo /etc/init.d/apache2 restart

Fire up Chrome browser and load up your site. Right-click anywhere on the site, click Inspect Element, click the Network tab then refresh the site to fill up the network pane. Click on the name of the actual HTML for the page. Make sure it's not a 304 (that's generally good but if it's coming from cache you can't see the mod_pagespeed headers to confirm if properly installed). Load a page you haven't visited before, clear your cache or try enabling private mode. Once you're able to load a page with a 200 you should see X-Mod-Pagespeed at the end of the Headers pane:

Google mod_pagespeed enabled - header

If you click on the Content tab to view the source of any page you might also see some differences depending on your pagespeed configuration. I saw three markup changes: 1) a timer script at the beginning and end of the page for statistics, 2) collapsed whitespace and 3) an adjusted and processed CSS include.

Google mod_pagespeed enabled - markup changes

I'm just touching the tip of the iceberg with server optimizations. If you care about performance you'd end up tinkering with nginx in the first place. I might end up doing that later on but I'm content with Apache. That and I'm not looking forward to rewriting my .htaccess rules into nginx.conf.

htop showing how happy my server is with the new setup

While I was at it I decided to upgrade my box from Ubuntu 9.04 to 10.04 LTS. I did the typical sudo apt-get update, upgrade and then do-release-upgrade, but received this error:

Can not run the upgrade. This usually is caused by a system were /tmp is mounted noexec. Please remount without noexec and run the upgrade again.

It turns out this error is pretty common. It's safe to keep /tmp as noexec but it has the downside of restricting certain release upgrades like this. My first thought was to simply remount /tmp as exec as the nice error suggested:

~  sudo mount -o remount,exec /tmp

warning: can't open /lib/init/fstab: No such file or directory
mount: can't find /tmp in /etc/fstab or /etc/mtab

Hrm, no go. Lets try chroot:

sudo mkdir -p /root/chroot /root/tmp
sudo mount --bind / /root/chroot
sudo mount --bind /root/tmp /root/chroot/tmp
sudo chroot /root/chroot

I ran do-release-upgrade again and it worked. A few minutes later:

~  cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.1 LTS"

Writing

Alright, your Jekyll setup is pretty much all complete but you want to write a post about your new setup before deploying the new site. Now what? Simply create a new file in the format of "year-month-day-post-slug.markdown" in your _posts directory. Add the necessary YAML front matter to the top of the post and then generate the site.

I wrote a rake task to save a few seconds each time I have to create a new post. You will need to install the chronic gem first though. It creates the new file in the proper directory and then opens the file in TextMate.

# ignore the "bit" stuff.. only relevant to my custom jekyll fork
# rake new type=(bit|post) future=0 title="New post title goes here" slug="slug-override-title"
desc 'create new post or bit. args: type (post, bit), title, future (# of days)'
task :new do
  require 'rubygems'
  require 'chronic'
  
  type = ENV["type"] || "bit"
  title = ENV["title"] || "New Title"
  future = ENV["future"] || 0
  slug = ENV["slug"].gsub(' ','-').downcase || title.gsub(' ','-').downcase

  if type == "bit"
    TARGET_DIR = "_bits"
  elsif future.to_i < 3
    TARGET_DIR = "_posts"
  else
    TARGET_DIR = "_drafts"
  end

  if future.to_i.zero?
    filename = "#{Time.new.strftime('%Y-%m-%d')}-#{slug}.markdown"
  else
    stamp = Chronic.parse("in #{future} days").strftime('%Y-%m-%d')
    filename = "#{stamp}-#{slug}.markdown"
  end
  path = File.join(TARGET_DIR, filename)
  post = <<-HTML
--- 
layout: TYPE
title: "TITLE"
date: DATE
---

HTML
  post.gsub!('TITLE', title).gsub!('DATE', Time.new.to_s).gsub!('TYPE', type)
  File.open(path, 'w') do |file|
    file.puts post
  end
  puts "new #{type} generated in #{path}"
  system "open -a textmate #{path}"
end

Pygments

What about getting code syntax highlighting working with pygments? Simply place your code in this curly brace liquid highlight block markup. To find out what lexers pygments supports, visit (and bookmark) this page. Here's a handy list of common language short names for use in pygments (they're guessable for the most part): python, perl, clojure, ruby, c, cpp, java, scala, csharp, common-lisp, erlang, haskell, console, mysql, cfm, django, css+php, css+ruby, erb, jsp, vim, actionscript3, css, haml, html, js, php, sass.

Jekyll liquid pygments This is an image because using liquid markup to show liquid markup is a PITA. There's a tricky solution though.

The last thing you'll need to do is get the CSS for the syntax highlighting and paste it in your stylesheet. Since I'm using Sass, I'll go ahead and convert it to Sass right off the bat. Pygments supports various styles you can use.

pygmentize -f html -S default > syntax.css && sass-convert syntax.css syntax.sass && mate syntax.sass

Paste the contents of syntax.sass into your primary sass file. Or you can ignore that line and just use the GitHub syntax style that you've come to know and love.

Deploying

There are a number of ways to deploy Jekyll to your live site. Use whichever method you prefer to move a bunch of html files to your web server's html directory. I went with a simple rsync solution added to my rakefile. This method requires that I generate the site locally and not on the server. I prefer this route for a few reasons:

No need to keep my server up-to-date on all the gems and related software I use
Easy to do a simple check locally and see if everything is all good, Sass correctly compiled, et cetera before deploy
rsync just works

desc 'deploy to pstam via rsync'
task :deploy do
  # uploads ALL files b/c I often do site-wide changes and prefer overwriting all
  puts 'DEPLOYING TO PAULSTAMATIOU.COM'
  # remove --rsh piece if not using 22
  sh "rsync -rtzh --progress --delete _site/ --rsh='ssh -pCUSTOM_PORT' user@domain.com:/var/www/domain.com/html/"
  puts 'done!'
end

I might end up writing another rake task to archive the previous deploy and give me the ability to rollback between dated deploys within seconds. Maybe even use anemone post-generation to crawl the site locally and check for dead links (which has happened before with dirty htaccess rules).

Thoughts

I'll be the first to say that this kind of blogging setup is most definitely not for everyone. Using Jekyll is pretty much just for developers and those that feel comfortable tinkering around with git, rsync, a ruby project and writing some basic Liquid markup for templates. In a lot of ways this can be seen as a step back from typical dynamic blogging systems. Having to generate static html pages? That might bring back horrors of early Perl-based Movable Type installs.

Running Jekyll for your publishing needs is akin to running vim for everything you do. It doesn't seem efficient to those not familiar but it's small enough (about 2000 lines of code excluding tests and blank lines, according to cloc) and easy enough for hackers to edit and streamline into their workflow as they see fit. Once you get it to your liking, that's it! You never have to update it, worry about security issues (aside from keeping your server up-to-date of course) or anything.

I'm extremely happy with my Jekyll setup. If I were to change anything, I might get around to hacking on the ability for Jekyll to do incremental generation. In the cases where I don't make site-wide layout changes and am only editing a single post, Jekyll won't end up needlessly generating the entire site. Oh wait, someone already did that. Time to fork.

What do you think? What is your currently blogging setup? Would something like Jekyll ever tickle your blogging fancy or is it all just too much work to setup?