Posts tagged with heroku

5 Early Lessons from Rapid, High Availability Scaling with Rails

At Ello, we were blindsided by the amount of traffic we were receiving. Right time, right place, I guess. One week, we're seeing a few thousand daily sessions. The following week, a few million. This insurgence of users meant the software we built was contorted in directions we never thought possible.

Like anything viral, there's a massive influx of interest for a relatively short period of time, followed by a slow decline, leaving a wake of destruction as the subject settles into its new mold. Ello has since settled, so what better time than now to document some of the lessons learned whiling scaling during those critical weeks of virality. I want to ensure these lessons are not merely light takeaways but rather tangible advice that you can apply if you're ever fortunate/unfortunate enough to be put in a similar situation. As such, parts of this article will be specific to Ello and may not apply in other domains.

Lesson 1: Move the graph

One of our first scaling hurdles involved managing the graph of relationships between users. We didn't just intuitively say, "oh, the graph is slow," but it didn't take much prodding either. We're on a standard Rails stack, using Heroku and Postgres. We have a table called relationships which stores all data about how users are tied together. Have you friended, blocked, or unfriended someone? It's all stored in the relationships table.

We're building a social network. By definition, our relationships table is one of the hottest tables we have. How many people are you following in total? How many in friends? How many in noise? Who should be notified when you create a post? All of these questions rely on the relationships table for answers. Answers to these questions will be cached by Postgres, so only the initial query incurs the cost of calculating the results. Subsequent queries are fast. But Postgres' query cache alone becomes meager at scale. As a user on a new social network, accumulating relationships is a regular activity. Every new relationships formed busts Postgres' cache for queries on that data. This was a high read, high write table.

Since we're on Heroku, we had the phenomenal Heroku Postgres tools at our disposal. When thrown into the fire, one of the best extinguishers was heroku pg:outliers. This command illuminates the top 10 slowest queries. All 10 of ours were associated with the relationships table. We had all the right indexes in place, yet some queries were taking up to 10 seconds to produce results.

Resolving a problem like this is application specific, but in our case the best option was to denormalize the relationship data into a datastore that could more easily answer our pertinent and frequent questions about the social graph. We chose Redis. It was a bit of a knee-jerk reaction at the time but a technique we've had success with in the past. Only after having implemented this, did we stumble upon a reassuring article outlining how Pinterest uses Redis for their graph. To be clear, we didn't move the data entirely, we provided an additional layer of caching. All data is still stored in Postgres for durability and is cached in Redis for speed. In the event of a catastrophe, the Redis data can be rebuilt at any time.

We moved all of our hot queries against the relationships table into Redis. Since "followers" and "followings" are displayed on every profile and a count(*) was our top outlier, our first step was to cache these values in Redis counters. We used Redis Objects to make this simple and elegant. Any time a new relationship was created or destroyed, these counters are incremented and decremented. When looking at another user's profile, to render the UI we needed to answer the question "are you following this user? If so, in the friends or noise bucket?" To answer this and similar questions, we cached the user IDs of all people who you had in your friends bucket, your noise bucket, and the union of both.

With our graph data in Redis, we can now query the graph in ways that would be prohibitively expensive with Postgres. In particular, we use it to influence our recommendation system. "Give me all the users that are being followed by people I'm following, but I'm not yet following." Using Redis set intersections, unions, and diffs, we can begin to derive new and interesting uses of the same data.

The real lesson here is this: every product has a core pillar that supports the core offering. Ello's is a social graph. When your core pillar begins to buckle under its own weight, it is critical to cache that data (or move it entirely) and continue providing your core offering.

Lesson 2: Create indexes early, or you're screwed

No really, you'll be chasing down these indexes for months. The previous section outlined how we scaled our relationships table. This, and subsequent sections will detail how we scaled our activities table, or the denormalized table that runs everyone's main activity feed. The activity feed contains any posts that people you follow have created, notifications for when someone follows you, notifications for mentions, and the like. Everything that you need to be notified about ends up in this table and we forgot some indexes.

Prior to Ello, I fell into the camp that created indexes only when data proves so. Sure, you can predict usage patterns, but since indexes can consume a lot of memory, I would have rather created them when I knew they were necessary. Big mistake here.

The first type of index that we forgot was just a plain old btree on a field that was queried regularly. An index like this can be created easily if nobody is writing to the table or downtime is feasible. This is high availability scaling, so downtime is not an option, and everything was writing to this table. Since the activity table was experiencing extremely high writes, concurrently building these indexes would never finish. While an index is being built concurrently (that is, without downtime), new records in the table are also added to the index. If the speed by which new records are added outpaces the speed by which Postgres can index hundreds of millions of existing rows, you're shit out of luck.

The solution? If downtime is not an option, you'll have to build a chokepoint in your application. All writes to a particular table must be funneled through this chokepoint so that if you want to stop writes, you constrict the chokepoint. In our case, we are using Sidekiq. We use Sidekiq jobs as our chokepoint, which means that if we ever want to stop all writes to the activities table, we spin down all Sidekiq workers for the queue that pertains to activity writes. Unworked jobs would get backed up and remain idle until we spun the workers back up, hence preventing writes to the activities table. Doing this for a couple minutes endowed Postgres with enough breathing room to work hard on building the index from existing records. Since Sidekiq jobs run asynchronously, this should have little impact on users. In our case, the worst that would happen is a user creates a post, refreshes the page, and sees that the post is not there because the activity record was not yet created. It's a tradeoff we made to keep the app highly available.

Situations like this are actually not the worst of it. The absolute worst is when you forget a unique index. Now your data is corrupt. We forgot a unique index, oops. When the level of concurrency necessary to run a rapidly scaling app reaches a point where you can't decipher whether a job is good or fallacious, you need to rely on your database's ACID characteristics. This is why Postgres is awesome: something will happen once and only once regardless of concurrency. If two jobs try to accomplish the same thing in parallel, Postgres will ensure only one of them wins. Only if you have a unique index.

An astute reader might ask, "well, why would two jobs try to accomplish the same thing?" Let me quickly explain. It all stems from one bad piece of data, that when used, creates more bad data. For example, we didn't have a unique index on the relationships table. So, I could technically follow another user twice. When the user I follow creates a new post and it becomes time to ask, "who should receive this post in their feed?", if you're relying on the relationships table to answer that question, you're relying on bad data. The system will now create two duplicate activities. This is just one reason for duplicate jobs. Others include computers being stupid, computers failing, and computers trying to fix their own stupidity and failures. Fixing the source of the bad data, the non-unique relationships, was a great launching point towards stability.

So many of our scaling choices were derived from not having a unique index. It was crippling. Firstly, you can't create a unique index with non-unique values in the table. Just won't happen. You need to first remove duplicates, which is terrifying. You're deleting data, and you better hope you're caffeinated enough to do it correctly. I also recommend 48 hours of sleep before attempting. What constitutes a duplicate depends on the data, but this Postgres wiki page on deleting duplicates is an excellent resource for finding them.

So, you delete duplicates, great. What about the time between deleting duplicates and adding a unique index? If any duplicates were added in the meantime, the index won't build. So, you start from square one. Delete duplicates. Did the index build? No? Delete duplicates.

Lesson 3: Sharding is cool, but not that cool

We sharded ourselves five times. Tee hee hee. Laugh it up. Sharding is so cool and webscale, we did it five times. I mentioned earlier that a lot of our scaling choices derived from the lack of a unique index. It took us two months to build a unique index on the activities table. At the point when the index was built, there were about a billion records in the table. Sharding would reduce the write traffic to each database and ease the pain of most tasks, including building a unique index.

For completeness, I want to define sharding. Like most things in software, sharding has conflated definitions, but here's mine. Sharding is the process of taking one large thing and breaking it into smaller pieces. We had one, large 750M record activities table that was becoming unwieldy. Prior to breaking down the activities table, we moved it out of our primary database (with users, posts, etc) into its own database, also a form of sharding. Moving it to a different database is horizontally sharding, breaking up a single table is vertically sharding or partitioning. We received recommendations from highly respected parties to think about vertically sharding when our table reached 100GB of data. We had about 200GB. We don't follow rules well.

I won't detail our sharding setup right now, but will mention that it took a lot of planning and practice to nail down. We used the Octopus gem to manage all of our ActiveRecord connection configurations, but that's certainly not the extent of it. Here are some articles you might find interesting: a general guide with diagrams, Braintree on MySQL, and Instagram on Postgres.

When sharding, say we have database A that is progressively slowing and needs to be broken down. Before sharding, users with IDs modulus 0 and 1 have their data in database A. After sharding, we want to make users with IDs modulus 0 continue going to database A and modulus 1 go to a new database B. That way, we can spread the load between multiple databases and they will each grow at roughly half the speed. The general sharding process is this: setup a new replica/follower database B, stop all writes to A, sever the replica (A and B are now two exact duplicate dbs), update the shard configuration so some data goes to A and some to B, resume writes, prune antiquated data from both A and B.

So cool, I love sharding.

Many highly respected and extremely expensive people told us we needed to shard. We trusted them. We planned out multiple approaches to sharding and converged on the technique outlined here, sharding by user ID. What nobody cared to consider was what would happen after we've sharded. We thought there was a pot of gold. Nope.

We sharded for two reasons: so we didn't hit a ceiling while vertically scaling our Postgres boxes. And so our queries would perform better because we had less data in each shard after the prune step. Let's address the prune step.

In the example above, since data for users with ID modulus 1 are no longer being stored or referenced in database A, we can safely remove all of their data. You're going to need a second pair of underwear. The simplified query for pruning database A is, "delete all records for users with ID modulus 1". The inverse is done on database B. In our case, we ended up removing almost exactly half of the records for each additional shard we created. This was our plan: if each time we shard, the databases store half the data, we need half the Postgres box to serve the same data.

Imagine we have four records in database A before sharding and pruning: [ W | X | Y | Z ]. After sharding and pruning, database A might look like this: [ W |      | Y |      ]. Database B might look like this: [      | X |      | Z ]. Notice the gaps. This equates to hard disk fragmentation. This started biting us in the ass and would have likely made our lives hell if we didn't already have other tricks up our sleeves.

If database A looks like this: [ W |      | Y |      ]. When I ask "give me all records for user ID 0", it should return W and Y. But W and Y are not in contiguous places on disk. So in order to service this query, Postgres must first move the disk to W, then move the disk to Y, skipping over the gaps in between. If W and Y lived next to each other on disk, the disk would not have to work so hard to fetch both records. The more work to be done, the longer the query.

Generally, when new data is added to the table, it's put in contiguous slots at the end of the disk (regardless of the user ID). We then ran a VACUUM ANALYZE on the table. Postgres now says, "oh, there's space between W and Y, I can put new data there!" So when new data is added and then fetched, Postgres needs to spin all the way back to the beginning of the disk to fetch some records, while other records for the same user are at the end of disk. Fragmentation coupled with running a VACUUM ANALYZE put us up shit creek. Users with a lot of activity simply couldn't load their feeds. The only sanctioned way to fix fragmentation is hours of downtime.

Ok, I hope you're still with me. The solution and lesson here are important. Firstly, if our Postgres boxes were on SSDs, maybe fragmentation wouldn't have been such a big deal. We weren't on SSDs. The solution for us was to build a covering index so that we could service index-only scans. Effectively, what this means is that all fields used to filter and fetch data from a table must be stored in an index. If it's all in the index, Postgres does not need to go to disk for the data. So we added a covering index for our hottest query and saw about a 100x improvement on average, up to 7000x improvements for users with a lot of activity.

The lesson here is twofold. Serving data from memory is exponentially faster than serving from disk. Be leery of serving data from disk at scale. The second lesson is equally important. We probably should have just scaled vertically as much as possible. Webscale was too sexy to avoid. "Shard all the things" is the meme I'm looking for. Sharding was challenging and a better long-term solution, but had we applied a covering index for the whole entire table before doing any vertical sharding, I believe we could have saved tons of time and stress by simply adding more RAM as our database grew.

Lesson 4: Don't create bottlenecks, or do

Early on, we made a decision that would have a profound affect on how we scaled the platform. You could either see it as a terrible decision or a swift kick in the ass. We chose to create an Ello user that everyone automatically followed when they joined the network. It's pretty much the MySpace Tom of Ello. The intention was good; use the Ello user for announcements and interesting posts curated from the network, by the network. The problem is most of our scaling problems originated from this user.

All of the scaling issues that would have been irrelevant for months or years were staring us right in the face within the first month of having a significant user base. By automatically following the Ello user, it meant that just about all users would receive any posted content from that account. In effect, millions of records would be created every time the Ello user posted. This continues to be both a blessing and a curse. Database contention? Ello user is probably posting. Backed up queues? Ello user is probably posting. Luckily we control this account, and we actually had to disable it until sharding was complete and unique indexes were built.

What seemed like a benign additional at the time ended up having prodigious impacts on how we scale the platform. Posting to the Ello account puts more load on the system than anything else, and we use this to keep tabs on our future scaling plans. Culturally, it's important for us to be able to post from the Ello account. Technically, it's a huge burden. It means that we need to scale the platform in accordance with one user, which is silly. But in retrospect it's a godsend for keeping us on our toes and being proactive about scaling.

It makes me wonder if on future projects, it would be a good idea to implement the equivalent of the Ello user. Through induction of pain, we have a better infrastructure. So the lesson here is: if the platform must stay ahead of impending scaling challenges, it's probably a good idea to self-inflict the problems early and often.

Lesson 5: It always takes 10 times longer

In the above sections, I managed to breeze through some difficult scaling lessons. Caching, sharding and optimizing are non-trivial engineering objectives. Thus far, I've been taken aback by just how difficult these endeavors end up being in practice.

Take caching the graph in Redis as an example. Going into it, it felt like something that could have been accomplished in a few days. The data's there, all we need to do is put it in Redis, and start directing traffic to Redis. Great, so step one is to write the scripts to migrate the data, that's easy. Step two is to populate the data in Redis. Oh, that's right, there are tens of millions of records that we're caching in multiple ways. Well, it'll take at least a couple hours to work our way through that many records. Yeah, but what about capturing the data that was inserted, updated and deleted within those two hours? We have to capture that as well. Better not have a bug in the migration process or there goes a few days of your life you'll never get back.

The sheer amount of practice alone for sharding can't be accounted for with a point-based estimate. Don't mess it up or you'll lose millions of peoples data. No pressure. But say you've practiced enough to get comfortable with the process and you're confident it will go well. Things will always arise. We added more shards and realized our pgbouncer pool size was maxed out. Since the system was live and new data was being written to the new shards, we couldn't revert the changes or we'd lose data. We had to figure out on the fly that the non-intuitive errors meant we needed to increase the pool size. We didn't predict that disk fragmentation was going to be a huge problem, either, and ended up becoming a top priority.

While trying to apply a unique index to the activities table, who would have thought there were so many duplicates? The initial strategy was to attempt to build the index, and when it failed, let the error message tell us where we had to remove duplicates. Building an index is slow, duh, that won't scale if we have to attempt to build the index thousands of times. Ok, so write a query to remove the duplicates first. But wait, you can't just execute a blanket query a billion records, it will never finish and potentially acquire heavy locks for hours at a time. Ok, so page through all the users, and scope the query so it only removes duplicates for a subset of users. That works, but unfortunately there were a ton of orphaned rows for users that no longer existed. So while paging through all the users that currently exist, the query is not deleting records for users who no longer exist and for some reason have orphaned records. Ok, so write a query to remove all activities for orphaned users. But wait, since the activities table doesn't live in the same database as the users table, you can't join against the users table to determine which activities records are orphaned. Not that that would scale anyway.

Sorry for rambling, but you get the point. The lesson here is for your mental health during a time of rapid scale: plan on everything taking 10 times longer than you anticipate. It just will.

Closing thoughts

You may have noticed a recurring theme within this article: the quantity of data is a big offender. There are other scaling challenges including team size, DNS, bot prevention, responding to users, inappropriate content, and other forms of caching. All of these can be equally laborious, but without a stable and scalable infrastructure, the opportunity for solving them diminishes.

Really, the TL;DR here is pretty short: cache aggressively, be proactive, push data into memory and scale vertically, highlight scaling bottlenecks sooner than later, and set apart plenty of man hours.

Oh, and enjoy the ride.

Posted by Mike Pack on 12/22/2014 at 09:37AM

Tags: ruby, rails, heroku, postgres, ello, scaling


"Incorrect" Timezone on Heroku

This is just a friendly reminder that Time.now on Heroku will give you server time, ie: PDT/PST, and not the local time for your app.

Since usually you want time calculations to be done with the correct zone for your app, just make sure to include 'zone' in your call for Time.now, like so:

Time.zone.now

I have the following in my config:

config/application.rb

config.time_zone = 'Mountain Time (US & Canada)'

Here's the difference for me in Heroku's console:

Time.now #=> 2011-08-20 18:53:06 -0700
Time.zone.now #=> Sat, 20 Aug 2011 19:53:08 MDT -06:00

So, remember, always use Time.zone.now instead of Time.now.

Happy time travel!

Posted by Mike Pack on 08/20/2011 at 07:56PM

Tags: heroku, time


Parsing Excel files on Heroku with roo

When it comes to processing Excel files in Ruby, your options are slim. A quick Google or Github search might reveal roo. roo is an interesting beast. It appears that the RubyGems.org gem file has been taken over and there's some large inconsistencies between the official gem and the Github codebase. The official gem is at version 1.9.5 as of this writing and the Github repo is still stuck at 1.3.11. Don't you hate when that happens?

This guide doesn't involve using the actual roo API, see the Github page or rubyforge page for that.

Getting roo to work on Heroku

Without tweaking, roo doesn't work on Heroku. It makes the (bad) assumption that the file system is writable. Check out this line of library to see where the temporary directory gets set. This line assigns a directory within the current working directory as the temporary file store.

On Heroku, that won't be possible. While the roo gem is packed in the dyno, it's directory is read only.

Directly after the linked line is another line that refers to the environment variable ROO_TMP. In version 1.2.0, ROO_TMP was added to alleviate issues where the current directory is inadequate to store temporary files.

To get roo to work on Heroku, create an initializer and set the environment variable to the Rails tmp directory (the only writable directory on Heroku):

config/initializers/roo.rb

ENV['ROO_TMP'] = "#{RAILS_ROOT}/tmp/"

Now, when your app is run either on Heroku or locally, roo will use the Rails tmp directory to store it's files.

Happy rooing!

Posted by Mike Pack on 08/02/2011 at 08:21PM

Tags: heroku, roo, excel


FreshBooks on Rack...on Heroku - Part Three

In Part One of this series, we constructed a "hello world" Rack app. In Part Two of this series, we brought our app to life with the ruby-freshbooks library. In this part, we'll quickly finish out by deploying our app to Heroku.

For the entire source code, head to https://github.com/mikepack/freshbooks_on_rack.

Getting the app on Heroku

The next step is getting your app on Heroku. You'll need a Heroku account first, obviously, and you'll need to configure your Heroku account so that your SSH key is recognized.

First, create your git repository and commit your files:

cd fb_on_rack_dir
git init
git add .
git commit -m 'Here is the app!'

Create your Heroku app:

heroku create your-fb-heroku-app

Push your changes to Heroku:

git push heroku master

Check it out! Open your browser, head to http://your-fb-heroku-app.heroku.com and you should see your running Rack app iterating over all your projects.

Rack HTTP Basic Authentication

Most likely, with all those valuable numbers, you don't want your app exposed to the world. The most simple way to remedy this is to add HTTP Basic Auth to your app. Rack makes this dead easy.

Change your config.ru file to look like the following:

config.ru

require 'rack'
require 'fb_on_rack'

use Rack::Auth::Basic, "FreshBooks on Rack" do |user, pass|
  user == 'me' and pass == 'secret'
end

run FBOnRack.new

This will present you with the oh-so-familiar HTTP Basic Auth prompt. We've hardcoded the username and password to be me and secret, respectively. You're now protected from all those internet wanderers.

A quick note about simplicity

While I greatly appreciate elucid's work on the library, ruby-freshbooks isn't perfect. During the creation of this app, I got some unexpected values as I was iterating through result sets. For instance, one iteration would provide me with a hash and the next an array. Here's an error indicative of the inconsistent result set:

TypeError at /
cant convert String into Integer

Ruby /mnt/hgfs/share/freskbooks_on_rack/fb_on_rack.rb: in [], line 25 Web GET localhost/

If you see this error, check your result set and handle it appropriately.

In Ruby 1.9.2, I would receive the following error as I started the Rack app with rackup:

ruby-1.9.2-p136/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': no such file to load -- fb_on_rack (LoadError)

This error is due to changes in Ruby 1.9 require's path expectation. To fix this...

config.ru

require 'fb_on_rack'

...should look like...

config.ru

require './fb_on_rack'

I wouldn't, by a stretch, call this a production-ready app. In lieu of handling all error and usecase situations, I kept it simple. My main goal is to show you the basics of working with the ruby-freshbooks gem and the potential it holds for aggregating important data.

Happy Freshbooking!

Posted by Mike Pack on 05/12/2011 at 05:26PM

Tags: freshbooks, rack, heroku, api


FreshBooks on Rack...on Heroku - Part Two

In Part One of this series, we constructed a "hello world" Rack app, so to speak. In this part, we'll dive right into using the ruby-freshbooks gem and a little metaprogramming to keep things DRY.

Once again, for the entire source code, head to https://github.com/mikepack/freshbooks_on_rack.

Working with the ruby-freshbooks gem

ruby-freshbooks maps API calls in a somewhat object-oriented fashion. You can call #create, #update, #list, etc on a good number of API entities (like time_entry, client and staff). Check the FreshBooks API docs for a full list of available methods. Generally, the methods are called like the following:

connection = FreshBooks::Client.new('youraccount.freshbooks.com', 'yourfreshbooksapitoken')
connection.client.list
connection.staff.get :staff_id => 1

You can authenticate with your API token (as shown above) or OAuth. For instruction on authenticating with OAuth, check the ruby-freshbooks docs.

Now, lets take a look at the full Rack app that simply prints out all the projects for N number of accounts, and totals the number of hours along with the total income.

Again, FBOnRack#call is invoked upon a request to our Rack app. This method is the heart and soul of our app.

fb_on_rack.rb

require 'ruby-freshbooks'

class FBOnRack
  @cachable_entities = ['staff', 'task']

  def initialize
    @connections = [FreshBooks::Client.new('account1.freshbooks.com', 'apitoken1'),
                   FreshBooks::Client.new('account2.freshbooks.com', 'apitoken2')]
  end

  def call(env)
    res = Rack::Response.new
    res.write "<title>FreshBooks on Rack</title>"
   
    @connections.each do |connection|
      connection.project.list['projects']['project'].each do |project|
        res.write "<h1>Project: #{project['name']}</h1>"
        total_income = 0.0
        total_hours = 0.0

        connection.time_entry.list(:project_id => project['project_id'])['time_entries']['time_entry'].each do |entry|
          rate = get_rate(connection, project, entry)
          total_hours += entry['hours'].to_f
          total_income += rate.to_f * entry['hours'].to_f
        end
        res.write "Total hours: #{total_hours}<br />"
        res.write "Total income: #{total_income}<br />"
      end
    end

    res.finish
  end

private

  @cachable_entities.each do |entity_name|
    cache_var = instance_variable_set("@#{entity_name}_cache", {})
    get_entity = lambda do |connection, entity_id|
      if cache_var.has_key?(entity_id) # Check if the entity is already cached
        cache_var[entity_id]
      else
        entity = connection.send(entity_name).get(("#{entity_name}_id").to_sym => entity_id)[entity_name] # Make the API call for whatever entity
        cache_var[entity_id] = entity # Cache the API call
      end
    end
    define_method(("get_#{entity_name}").to_sym, get_entity)
  end

  def get_rate(connection, project, entry)
    case project['bill_method']
    when 'project-rate'
      project['rate']
    when 'staff-rate'
      get_staff(connection, entry['staff_id'])['rate']
    when 'task-rate'
      get_task(connection, entry['task_id'])['rate']
    end
  end
end

Lets break this down a little.

The first thing we should tackle is that strange loop at the start of the private definitions:

@cachable_entities.each do |entity_name|
  cache_var = instance_variable_set("@#{entity_name}_cache", {})
  get_entity = lambda do |connection, entity_id|
    if cache_var.has_key?(entity_id) # Check if the entity is already cached
      cache_var[entity_id]
    else
      entity = connection.send(entity_name).get(("#{entity_name}_id").to_sym => entity_id)[entity_name] # Make the API call for whatever entity
      cache_var[entity_id] = entity # Cache the API call
    end
  end
  define_method(("get_#{entity_name}").to_sym, get_entity)
end

The gist of this is to define a caching mechanism so we're not slamming the FreshBooks API. If we fetch an entity by the entity's ID, cache the result for that ID. Let's break this chunk of code down once we're inside the loop:

cache_var = instance_variable_set("@#{entity_name}_cache", {})

This does what you would expect: it defines an instance variable. @cachable_entities contains two entities we want to cache, staff and task. So in the end we have two class instance variables which act as in-memory cache: @staff_cache = {} and @task_cache = {}.

get_entity = lambda do |connection, entity_id|
  if cache_var.has_key?(entity_id) # Check if the entity is already cached
    cache_var[entity_id]
  else
    entity = connection.send(entity_name).get(("#{entity_name}_id").to_sym => entity_id)[entity_name] # Make the API call for whatever entity
    cache_var[entity_id] = entity # Cache the API call
  end
end

Here we define a closure which will fetch our result either from the cache or make a call to the API. After retrieving our entity from the API, we cache it.

define_method(("get_#{entity_name}").to_sym, get_entity)

Here we define our methods (#get_staff(connection, staff_id) and #get_task(connection, task_id)) as private instance methods. The get_entity parameter here is our lambda closure we defined above.

#get_staff and #get_task are called within our #get_rate method (but could be used elsewhere). #get_rate returns the rate which should be used for a given time entry. Rates can be project-based, staff-based or task-based. We need to find the appropriate rate based on the project['bill_method'].

Modify this code to your needs, restart your Rack server, visit http://localhost:9292/ and you should see all your projects, the total time spent on each and the total income from each.

If you've made it this far, give yourself a pat on the rear because this part in the series is definitely the hardest. Let me know if you have any issues understanding the FBOnRack class above. In Part Three of this series, we'll finish off by deploying to Heroku and baking a cake.

Posted by Mike Pack on 05/04/2011 at 11:42AM

Tags: freshbooks, rack, heroku, api


FreshBooks on Rack...on Heroku - Part One

I love FreshBooks. It makes my time tracking incredibly easy and my invoicing hassle free. That's not all; their website is extremely powerful but feels lightweight and friendly to use. As a software engineer, I really appreciate and expect my invoicing tool to be Web 2.0 and fun.

For it's free account, FreshBooks only allows you to have 3 clients. You can have any number of projects under those 3 clients but they set a cap in hopes you'll pay for their service. Well, I have more than 3 clients and I love freemium. FreshBooks allows you to have any number of freemium accounts. While it's a pain in the ass to have to switch between accounts for invoicing, it's worth the $20/month I'm saving...for now.

Another annoying thing about working with numerous freemium accounts is you can't quickly calculate numbers based on all projects you have. For instance, the total income from all projects or the total projected income for a single month. To remedy this, I wanted to create a lightweight Heroku app which would poll all my FreshBook freemium accounts and calculate some numbers, specifically the total hours spent and the total income for each project. FreshBooks has an API which allows me to do just that. +1 FreshBooks!

In this three-part series we'll build the app from the ground up, starting with Rack and then deploying to Heroku. Lets get started.

For the source of the entire project, head to https://github.com/mikepack/freshbooks_on_rack.

Rails to Rack

One consideration for this little project is the framework to use or if a framework is appropriate at all. I sought the following:

  • Rack based...It needs to run on Heroku
  • Lightweight...I don't have a model so I don't need MVC

I love frameworks. They make most arduous tasks simple. I'm a Rails programmer but I realize for a project this small, Rails is way overkill. Other potential overkill frameworks include Merb, Sinatra and even Camping. I stumbled upon Waves, a resource oriented, ultra-lightweight framework. Unfortunately, Waves has been dead since 2009. So I decided to ditch the framework and go straight to Rack.

Rack Setup

If you don't have Rack yet, install the gem:

gem install rack

Create a directory for your project:

mkdir freshbooks_on_rack
cd freshbooks_on_rack

Rack expects a configuration file to end in .ru:

config.ru

require 'rack'

You should now be able to run your Rack application from the command line:

rackup

Visit http://localhost:9292/ to see your app in action. It won't do anything yet (it'll throw an error), but the Rack basics are there.

FreshBooks Setup

FreshBooks has a great API. There's also a great corresponding gem which ruby-fies the API responses. The ruby-freshbooks gem is an isomorphic, flexible library with very little dependencies. freshbooks.rb is another FreshBooks API wrapper gem but has one major quip: it uses a global connection so you can't (naturally) pull in information from different FreshBooks accounts.

Heroku uses Bundler so create a Gemfile (you don't need the rack gem in your Gemfile):

Gemfile

source 'http://rubygems.org'
gem 'ruby-freshbooks'

Install the gem with Bundler:

bundle

Creating the Rack App

 Let's create a class which will handle our Rack implementation:

fb_on_rack.rb

require 'ruby-freshbooks'

class FBOnRack
  def call(env)
    res = Rack::Response.new
    res.write '<title>FreshBooks on Rack</title>'
    res.finish
  end
end

All our class does so far is respond to a Rack call with a title tag. I use Rack::Response here to make things a little easier when it comes to the expected return value of our #call method. Take a look at this example for a reference on how to respond without using Rack::Response.

Now lets update our config.ru file to call our FBOnRack class:

config.ru

require 'rack'
require 'fb_on_rack'

run FBOnRack.new

FBOnRack#call will now be invoked when a request is made to our Rack app. Restart your Rack app (you'll need to do this on every code change), visit http://localhost:9292/ and you should see a blank page with a title of "FreshBooks on Rack".

Tada! You've just created a sweet Rack app, you should feel proud. In Part Two of this series, we'll take a look at how to work with the ruby-freshbooks gem. We'll generate simple, yet useful data and get our hands dirty in some metaprogramming.

Posted by Mike Pack on 05/01/2011 at 02:37PM

Tags: freshbooks, rack, heroku, api