ljr/wcmtools/gearman/doc/overview.txt

225 lines
5.8 KiB
Plaintext
Executable File

[ WARNING: EXTREMELY PRELIMINARY! ]
GearMan: A distributed job system
Brad Whitaker <whitaker@danga.com>
==================================
TODO: error responses to malformed/unexpected packets?
priorities, expirations of old/irrelevant jobs
upper-layer handling of async system going down + reopulation of jobs
Architecture:
[ job_server_1 ] \
[ job_server_2 ] | <====> application
[ job_server_n ] /
\ | /
-*- persistent tcp connections between job servers/workers
/ | \
[ worker_1 ]
[ worker_2 ]
[ worker_n ]
Guarantees:
1) Each job server will kill dups within that job server (not global)
2) Jobs will be retried as specified, as long as the job server is running
3) Each worker will have exactly one task at a time
... ?
Non-Guarantees:
1) Global duplicate checking is not provided
2) No job is guaranteed to complete
3) Loss of any job server will lose any jobs registered with it
... ?
Work flow:
1) Job server starts up, all workers connect and announce their
2) Application sends job to random job server, noting the job record so it can
be recreated in the future if necessary.
a) Synchronous: Application stays connected and waits for response
b) Asynchronous:
3) Job server handles job:
Possible messages:
[worker => job server]
"here's what i can do."
"goodbye."
"i'm about to sleep."
"got a job for me?"
"i'm 1/100 complete now."
"i've completed my job."
[job server => worker]
"noop." / "wake up."
"here's a job to do."
[application => job server]
"create this new job."
"how far along is this job?"
"is this job finished?"
[job server => application]
"okay (here is its handle)."
"job is 1/100 complete."
"job completed as follows: ..."
Request/Response cycles:
[ worker <=> job server ]
"here's what i can do" => (announcement)
"goodbye" => (announcement)
"i'm about to sleep" => (announcement)
"i'm 1/100 complete now" => (announcement)
"i've completed my job" => (announcement)
"got a job for me?" => "here's a job to do."
[ application <=> job server ]
"create this new job." => "okay (here is its handle)."
"how far along is this job?" => "job is 1/100 complete."
"is this job finished?" => "job completed as follows: ..."
[ job server <=> worker ]
"wake up." => (worker wakes up from sleep)
"here is a job to do" => "i'm 1/100 complete now."
=> "i've completed my job"
[ job server <=> application ]
(only speaks in response to application requests)
Best case conversation example:
worker_n => job_server_n: "got a job for me?"
job_server_n => worker_n: "yes, here is a job i've locked for you"
worker_n => job_server_n: "here is the result"
Worse case:
while ($time < $sleep_threshold) {
for $js (1..n) {
worker => job_server_$js: "got a job for me?"
job_server_$js => worker: "no, sorry"
}
}
worker => all_job_servers: "going to sleep"
[ worker receives wakeup packet ] or [ t seconds elapse ]
worker wakes up and resumes loop
Packet types:
Generic header:
[ 4 byte magic
4 byte packet type
4 byte length ]
Magic:
4 opaque bytes to verify state machine "\0REQ" or "\0RES"
Packet type:
(see Gearman::Util)
Length:
Post-header data length
Properties of a job:
func -- what function name
opaque scalar arg (passed through end-to-end, no interpretation by libraries/server)
uniq key -- for merging (default: don't merge, "-" means merge on opaque scalar)
retry count
fail after time -- treat a timeout as a failure
do job if dequeued and no listeners ("submit_job_bg")
priority ("submit_job_high")
on_* handlers
behavior when there's no worker registered for that job type?
Notes:
-- document whether on_fail gets called on all failures, or just last one, when retry_count is in use
-- document that uniq merging isn't guaranteed, just that it's permitted. if two tasks must not run
at the same time, the task itself needs to do appropriate locking with itself / other tasks.
-- the uniq merging will generally work in practice with multiple Job servers because the client
hashes the (func + opaque_arg) onto the set of servers
Task summary:
1) mail
name => mail
dupkey => '' (don't check dups)
type => async+handle
args => storable MIME::Lite/etc
2) gal resize
name => resize
dupkey => uid-gallid-w-h
type => async+handle
args => storable of images to resize/how
3) thumb pregen
name => thumbgen
dupkey => uid-upicid-w-h
type => async+handle
args => storable of images to resize
4) LJ crons
name => pay_updateaccounts/etc
dupkey => '' (no dup checking)
type => async
args => @ARGV to pass to job?
6) Dirty Flushing
name => dirty
dupkey => friends-uid, backup-uid, etc
type => async+handle
args => none
7) CmdBuffer jobs
name => weblogscom
dupkey => uid
type => async+throw-away
args => none
8) RSS fetching
name => rss
dupkey => uid
type => async+handle
args => none
9) captcha generation
name => captcha
dupkey => dd-hh ? maybe '1' or something
type => async+throw-away
args => none
10) birthday emails
name => bday
dupkey => yyyy-mm-dd
type => async+handle
args => none
11) restart web nodes
-- ask brad about this?