[ WARNING:  EXTREMELY PRELIMINARY! ]

GearMan: A distributed job system
Brad Whitaker <whitaker@danga.com>

==================================

TODO: error responses to malformed/unexpected packets?
      priorities, expirations of old/irrelevant jobs
      upper-layer handling of async system going down + reopulation of jobs

Architecture:

    [ job_server_1 ] \
    [ job_server_2 ]  | <====> application
    [ job_server_n ] /

         \ | /
          -*-    persistent tcp connections between job servers/workers
         / | \ 

      [ worker_1 ]
      [ worker_2 ]
      [ worker_n ]


Guarantees: 

1) Each job server will kill dups within that job server (not global)
2) Jobs will be retried as specified, as long as the job server is running
3) Each worker will have exactly one task at a time
   ... ?

Non-Guarantees:

1) Global duplicate checking is not provided
2) No job is guaranteed to complete
3) Loss of any job server will lose any jobs registered with it
   ... ?


Work flow:

1) Job server starts up, all workers connect and announce their 

2) Application sends job to random job server, noting the job record so it can
   be recreated in the future if necessary.

   a) Synchronous: Application stays connected and waits for response
   b) Asynchronous: 

3) Job server handles job:

   Possible messages:

      [worker => job server]
      "here's what i can do."
      "goodbye."
      "i'm about to sleep."
      "got a job for me?"
      "i'm 1/100 complete now."
      "i've completed my job."

      [job server => worker]
      "noop." / "wake up."
      "here's a job to do."

      [application => job server]
      "create this new job."
      "how far along is this job?"
      "is this job finished?"

      [job server => application]
      "okay (here is its handle)."
      "job is 1/100 complete."
      "job completed as follows: ..."


   Request/Response cycles:

      [ worker <=> job server ]
      "here's what i can do"       => (announcement)
      "goodbye"                    => (announcement)
      "i'm about to sleep"         => (announcement)
      "i'm 1/100 complete now"     => (announcement)
      "i've completed my job"      => (announcement)
      "got a job for me?"          => "here's a job to do."

      [ application <=> job server ]
      "create this new job."       => "okay (here is its handle)."
      "how far along is this job?" => "job is 1/100 complete."
      "is this job finished?"      => "job completed as follows: ..."   

      [ job server <=> worker ]
      "wake up."                   => (worker wakes up from sleep)
      "here is a job to do" => "i'm 1/100 complete now."
                            => "i've completed my job"

      [ job server <=> application ]
      (only speaks in response to application requests)      


   Best case conversation example:

      worker_n     => job_server_n: "got a job for me?"
      job_server_n => worker_n:     "yes, here is a job i've locked for you"
      worker_n     => job_server_n: "here is the result"

   Worse case:

      while ($time < $sleep_threshold) {
         for $js (1..n) {
            worker => job_server_$js: "got a job for me?"
            job_server_$js => worker: "no, sorry"
         }
      }

      worker => all_job_servers: "going to sleep"

      [ worker receives wakeup packet ] or [ t seconds elapse ]

      worker wakes up and resumes loop

      
Packet types:

   Generic header:

      [ 4 byte magic
        4 byte packet type       
        4 byte length ]

      Magic:
         4 opaque bytes to verify state machine "\0REQ" or "\0RES"

      Packet type:
         (see Gearman::Util)

      Length:
         Post-header data length


Properties of a job:

     func -- what function name
     opaque scalar arg  (passed through end-to-end, no interpretation by libraries/server)
     uniq key -- for merging  (default: don't merge, "-" means merge on opaque scalar)

     retry count
     fail after time  -- treat a timeout as a failure
     do job if dequeued and no listeners ("submit_job_bg")
     priority ("submit_job_high")
     on_* handlers

     behavior when there's no worker registered for that job type?


Notes:

   -- document whether on_fail gets called on all failures, or just last one, when retry_count is in use
   -- document that uniq merging isn't guaranteed, just that it's permitted.  if two tasks must not run
      at the same time, the task itself needs to do appropriate locking with itself / other tasks.
   -- the uniq merging will generally work in practice with multiple Job servers because the client
      hashes the (func + opaque_arg) onto the set of servers



Task summary:

1) mail
   name    => mail
   dupkey  => '' (don't check dups)
   type    => async+handle
   args    => storable MIME::Lite/etc

2) gal resize
   name    => resize
   dupkey  => uid-gallid-w-h
   type    => async+handle
   args    => storable of images to resize/how   

3) thumb pregen
   name    => thumbgen
   dupkey  => uid-upicid-w-h
   type    => async+handle
   args    => storable of images to resize

4) LJ crons
   name    => pay_updateaccounts/etc
   dupkey  => '' (no dup checking)
   type    => async
   args    => @ARGV to pass to job?

6) Dirty Flushing
   name    => dirty
   dupkey  => friends-uid, backup-uid, etc
   type    => async+handle
   args    => none

7) CmdBuffer jobs
   name    => weblogscom
   dupkey  => uid
   type    => async+throw-away
   args    => none

8) RSS fetching
   name    => rss
   dupkey  => uid
   type    => async+handle
   args    => none

9) captcha generation
   name    => captcha
   dupkey  => dd-hh ? maybe '1' or something
   type    => async+throw-away
   args    => none

10) birthday emails
   name    => bday
   dupkey  => yyyy-mm-dd
   type    => async+handle
   args    => none

11) restart web nodes
   -- ask brad about this?