Scheduler exposes an HTTP-server that gives a gateway to the scheduler-worker-cluster. HTTP requests to the scheduler api is translated and proxied into workers.
- Scheduler runs an HTTP server as well as a GRPC server.
- Scheduler keeps record of the available workers in the cluster.
- Workers can register and deregister on the scheduler.
Data Structures
When a new worker registers to the scheduler, a worker object is created. Pointer to this worker object is kept in the workers map with key being the worker id and value being the pointer to the worker. Worker objects holds the id of the worker and the address of the worker's GRPC server to start/end/query jobs.
workers
var (
workersMutex = &sync.Mutex{}
workers = make(map[string]*worker)
)
// worker holds the information about registered workers
// - id: uuid assigned when the worker first register.
// - addr: workers network address, later used to create grpc client to the worker
type worker struct {
id string
addr string
}
Configuration for the scheduler is done through the config.toml file.
I wrote about how I approach configuration in Go projects. You can read more hereHandling configuration in Go
grpc_server:
- addr: Address on which the GRPC server will be run.
- use_tls: Whether the GRPC server should use TLS. If true, crt_file and key_file should be provided.
- crt_file: Path to the certificate file for TLS.
- key_file: Path to the key file for TLS.
http_server:
- addr: Address on which the HTTP server will be run on.
scheduler/config.toml
[grpc_server]
addr = "127.0.0.1:50000"
use_tls = false
crt_file = "server.pem"
key_file = "server.key"
[http_server]
addr = "127.0.0.1:3000"
GRPC Server
Only 2 GRPC requests required in the scheduler GRPC server which are used to register and deregister workers on the scheduler.
Scheduler GRPC server definitions
service Scheduler {
rpc RegisterWorker(RegisterReq) returns (RegisterRes) {}
rpc DeregisterWorker(DeregisterReq) returns (DeregisterRes) {}
}
message RegisterReq {
string address = 1;
}
message RegisterRes {
bool success = 1;
string workerID = 2;
}
message DeregisterReq {
string workerID = 1;
}
message DeregisterRes {
bool success = 1;
}
RegisterWorker is called by any worker that comes online, the scheduler then assigns a unique worker ID to that worker and registers the worker in the workers map with the id and the GRPC address passed in the request. This address will later be used to start/stop/query jobs on the worker.
DeregisterWorker is called by any worker that goes offline, the workerID specified in the deregister request will be removed from known workers.
HTTP API
There are 3 requests that can be made to the scheduler via HTTP. These are to start/stop/query jobs on a specific worker.
A job is calling a specified file via a specified command on a worker. For example: if you have a python worker, scheduler can call a python script via python3 and query the results later.
Start job:
POST "/start"
Request:
- "Run which file on which worker with what command"
- Provide: command / path / worker_id
Response:
- returns job_id: which can be used to query or stop the job later.
example start request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"command": "bash",
"path": "worker/scripts/count.sh"
}
example start response - success
{
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example start response - fail
{
"error": "worker not found"
}
Stop job:
POST "/stop"
Request:
- "Stop which job on which worker"
- Provide: worker_id / job_id
Response:
- returns success: boolean response.
example stop request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example stop response - success
{
"success": true
}
example stop response - fail
{
"error": "worker not found"
}
Query job:
POST "/stop"
Request:
- "Query which job on which worker"
- Provide: worker_id / job_id
Response:
- returns done: boolean for if the job is done.
- returns error: boolean for if the job had an error.
- returns error_text: string for the error thrown by the job.
example query request
{
"worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
"job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example query response - success
{
"done": true,
"error": true,
"error_text": "signal: killed"
}
example query response - fail
{
"error": "worker not found"
}
HTTP request and response structs on scheduler
// apiStartJobReq expected API payload for `/start`
type apiStartJobReq struct {
Command string `json:"command"`
Path string `json:"path"`
WorkerID string `json:"worker_id"`
}
// apiStartJobRes returned API payload for `/start`
type apiStartJobRes struct {
JobID string `json:"job_id"`
}
// apiStopJobReq expected API payload for `/stop`
type apiStopJobReq struct {
JobID string `json:"job_id"`
WorkerID string `json:"worker_id"`
}
// apiStopJobRes returned API payload for `/stop`
type apiStopJobRes struct {
Success bool `json:"success"`
}
// apiQueryJobReq expected API payload for `/query`
type apiQueryJobReq struct {
JobID string `json:"job_id"`
WorkerID string `json:"worker_id"`
}
// apiQueryJobRes returned API payload for `/query`
type apiQueryJobRes struct {
Done bool `json:"done"`
Error bool `json:"error"`
ErrorText string `json:"error_text"`
}
// apiError is used as a generic api response error
type apiError struct {
Error string `json:"error"`
}