Executor
The Valve Infra executor is the service that coordinates different services to enable time-sharing test machines, AKA devices under test (DUTs).
This service can be interacted with using executorctl, our client, and/or our REST API.
The executor coordinates the different states for the DUTs, here is a flow chart between the different states.
- Let’s see what every state of a DUT means:
IDLE
: The device is available (but powered down to save energy), waiting for a job.TRAINING
: The device is being tested for boot reliability (20 rounds by default).RETIRED
: The device is undergoing maintenance, and cannot accept jobs.QUICK_CHECK
: The device is verifying that its current configuration matches what is described in the database.QUEUED
: The device has been chosen to execute a job, but the executor isn’t ready just yet (expected to last <1s)RUNNING
: The device is running a job.
Configuration
The executor service is configured through the use of environment variables.
Here are the relevant options to most deployment:
BOOTS_TFTP_ROOT
: Cache folder for boot-related artifacts (default: /mnt/tmp/boots/tftp)
BOOTS_DEFAULT_*
: See Default boot configuration for more details.
EXECUTOR_URL
: HTTP url of the executor service, reachable locally and from the test machines (default: http://ci-gateway)
EXECUTOR_ARTIFACT_CACHE_ROOT
: Folder to use as a cache for the kernel/initrd artifacts used by the jobs (recommended, default: None)
FARM_NAME
: Name of the test form (mandatory, default: None)
GITLAB_CONF_FILE
: Path to the gitlab runner configuration file, which will be overridden as new test machines are added to the farm (default: /etc/gitlab-runner/config.toml)
GITLAB_CONF_TEMPLATE_FILE
: Template to use for the creation of the gitlab runner configuration file (default: $package_dir/templates/gitlab_runner_config.toml.j2)
MARS_DB_FILE
: Path to the database (default: /mnt/permanent/mars_db.yaml)
MINIO_URL
: URL to the local minio service, accessible both locally and by test machines (default: http://ci-gateway:9000`)
MINIO_ROOT_USER
: Admin username for the local minio service (default minioadmin)
MINIO_ROOT_PASSWORD
: Admin password for the local minio service (default minio-root-password)
PRIVATE_INTERFACE
: Network interface connected to the DUTs’ network (default: private)
SALAD_URL
: URL to the salad service (default: http://ci-gateway:8005)
SERGENT_HARTMAN_BOOT_COUNT
: How many rounds of testing should be used to qualify a test machine (default: 100)
SERGENT_HARTMAN_QUALIFYING_BOOT_COUNT
: How many successful rounds of testing should be used to qualify a test machine (default: 100)
SERGENT_HARTMAN_REGISTRATION_RETRIAL_DELAY
: How many seconds should be waited after an unsuccessful registration attempt before trying another one (default: 120)
And here are the lower-level options:
BOOTS_DISABLE_SERVERS
: Set to a non-empty value to disable netbooting services (DHCP and TFP). (default: None)
CONSOLE_PATTERN_DEFAULT_MACHINE_UNFIT_FOR_SERVICE_REGEX
: Automatically tag a DUT as unfit for service if it generates a line matched by this regular expression (default: None)
EXECUTOR_HOST
: Binding address for the HTTP service (default: 0.0.0.0)
EXECUTOR_PORT
: Binding port for the HTTP service (default: 80)
EXECUTOR_REGISTRATION_JOB
: Local path to the registration job (default: $package_dir/job_templates/register.yml.j2)
EXECUTOR_BOOTLOOP_JOB
: Local path to the registration job (default: $package_dir/job_templates/bootloop.yml.j2)
EXECUTOR_VPDU_ENDPOINT
: Automatically add a virtual PDU for local testing (format: host:port, default: None)
MINIO_ADMIN_ALIAS
: Alias set up by the executor to refer to the minio instanced specified byMINIO_URL
,MINIO_ROOT_USER
, andMINIO_ROOT_PASSWORD
(default: local)
Default boot configuration
When an unsolicited boot request is received by the executor (eg. an admin added a new test machine), it needs to know which kernel/initrd/cmdline this test machine needs to run in order to complete its registration.
Here are the most relevant options:
BOOTS_DEFAULT_KERNEL
: Default kernel to use to boot unknown test machines (default: http://ci-gateway:9000/boot/default_kernel)
BOOTS_DEFAULT_INITRD
: Default initramfs to use to boot unknown test machines (default: http://ci-gateway:9000/boot/default_boot2container.cpio.xz)
BOOTS_DEFAULT_CMDLINE
: Default kernel command line to use to boot unknown test machines (default: b2c.container=”-ti –tls-verify=false docker://ci-gateway:8002/mupuf/valve-infra/machine_registration:latest register” b2c.ntp_peer=”ci-gateway” b2c.cache_device=none loglevel=6)
However, since no single kernel/initramfs may be suitable for all the possible DUTs, the executor will look for the most suitable value by checking its environment variables in the following order:
BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]
BOOTS_DEFAULT_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]
BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_[KERNEL|INITRD|CMDLINE]
BOOTS_DEFAULT_${ARCH}_[KERNEL|INITRD|CMDLINE]
BOOTS_DEFAULT_${BOOTLOADER}_[KERNEL|INITRD|CMDLINE]
BOOTS_DEFAULT_[KERNEL|INITRD|CMDLINE]
With the variables taking the following values:
${BOOTLOADER}
: IPXE
${ARCH}**
: I386, X86_64, ARM32, ARM64
${PLATFORM}
: PCBIOS, EFI
Example: The following options specify how to boot x86_64 (PCBIOS or EFI) and ARM64 (EFI-only) test machines. Please note how the same command line is used for all configurations, and how the ARM64 architecture only has a kernel specified for the EFI platform while the same kernel will be served for both the EFI and PCBIOS platforms.
BOOTS_DEFAULT_X86_64_KERNEL
: https://ci-gateway:9000/boot/default_x86_64_kernel
BOOTS_DEFAULT_X86_64_INITRD
: https://ci-gateway:9000/boot/default_x86_64_initrd
BOOTS_DEFAULT_ARM64_EFI_KERNEL
: https://ci-gateway:9000/boot/default_arm64_kernel.efi
BOOTS_DEFAULT_ARM64_INITRD
: https://ci-gateway:9000/boot/default_arm64_initrd
BOOTS_DEFAULT_CMDLINE
: b2c.container=”-ti –tls-verify=false docker://ci-gateway:8002/mupuf/valve-infra/machine_registration:latest register” b2c.ntp_peer=”ci-gateway” b2c.cache_device=none loglevel=6
Executor client - executorctl
The executor client can be found in git under
executor/client
and installed with pip
.
It can be used to queue a job on a DUT from the command line, when its state is IDLE:
$ executorctl run -t $machine_tag $/path/to/job/file
Here is an extract of the command line for executorctl run
:
usage: Executor client run [-h] [-w] [-c CALLBACK] [-t MACHINE_TAGS] [-i MACHINE_ID] [-s SHARE_DIRECTORY] [-j JOB_ID] [-a MINIO_AUTH] [-g MINIO_GROUP] job
positional arguments:
job Job that should be run
options:
-h, --help show this help message and exit
-w, --wait Wait for a machine to become available if all are busy
-c CALLBACK, --callback CALLBACK
Hostname that the executor will use to connect back to this client, useful for non-trivial routing to the test device
-t MACHINE_TAGS, --machine-tag MACHINE_TAGS
Tag of the machine that should be running the job. Overrides the job's target.
-i MACHINE_ID, --machine-id MACHINE_ID
ID of the machine that should run the job. Overrides the job's target.
-s SHARE_DIRECTORY, --share-directory SHARE_DIRECTORY
Directory that will be forwarded to the job, and whose changes will be forwarded back to
-j JOB_ID, --job-id JOB_ID
Identifier for the job, if you have one already.
-a MINIO_AUTH, --minio-auth MINIO_AUTH
MinIO credentials that has access to all the groups specified using '-g'
-g MINIO_GROUP, --minio-group MINIO_GROUP
Add the MinIO job user to the specified group. Requires valid credentials specified using '--minio-auth' which already have access this group
Examples of job that can be run under vivian can be found at job_templates
TODO Properly document the job description and file format
REST API
The executor includes a REST API with various endpoints available.
Endpoint /duts
Method: GET
Lists the available machines and their information (IP address, tags, …)
curl localhost:8000/api/v1/duts
Endpoint /dut/
Method: POST, PUT
Adds a new machine to MARS_DB_FILE
, if there is a discovery process on-going
it’ll use this data to set the PDU and port_id.
This endpoint is used from the machine_registration.py
script.
Endpoint /dut/<machine_id>
Method: GET
Lists all the information of a selected machine. machine_id is the MAC Address.
curl localhost:8000/api/v1/dut/<machine_id>
curl localhost:8000/api/v1/dut/52:54:00:11:22:0a
Method: DELETE
Remove the machine from the database, and all its associated GitLab runner tokens.
curl -X DELETE localhost:8000/api/v1/dut/<machine_id>
Method: PATCH
Updates the pdu_off_delay time. Value are seconds. Updates the comment of the machine. Value is a string.
curl -X PATCH localhost:8000/api/v1/dut/52:54:00:11:22:0a \
-H 'Content-Type: application/json' \
-d '{"pdu_off_delay": 10, "comment": "this is an example comment"}'
Endpoint /duts/<machine_id>/boot.ipxe
Method: GET
TODO: To be documented.
Endpoint /dut/<machine_id>/quick_check
Method: GET
Returns true
if a quick check of the machine has been queued, false
otherwise.
curl localhost:8000/api/v1/dut/<machine_id>/quick_check
Method: POST
Queue a quick check on the machine. No parameters are needed.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/quick_check
Endpoint /dut/discover
Method: GET
Shows if there is a discovery process on-going and the data of this discovery: pdu, port_id and start date.
curl localhost:8000/api/v1/dut/discover
Method: POST
Launchs a discovery process, it will boot the machine behind
a given PDU/port_id and will put this data in discover_data
to
be used by the machine_registration.py
script.
curl -X POST localhost:8000/api/v1/dut/discover \
-H 'Content-Type: application/json' \
-d '{"pdu": "VPDU", "port_id": '10'}'
If no machines show up, the discovery process will automatically
timeout after 150 seconds by default. This value can be specified
using the timeout
parameter:
curl -X POST localhost:8000/api/v1/dut/discover \
-H 'Content-Type: application/json' \
-d '{"pdu": "VPDU", "port_id": '10', "timeout": '60'}'
Method: DELETE
Erases all the discovery data, discover_data will be emptied.
curl -X DELETE localhost:8000/api/v1/dut/discover
Endpoint /dut/<machine_id>/retire
Method: POST
Marks as retired a machine. machine_id is the MAC Address.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/retire
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/retire
Endpoint /dut/<machine_id>/activate
Method: POST
Unmarks as retired a machine. machine_id is the MAC Address.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/activate
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/activate
Endpoint /dut/<machine_id>/cancel_job
Method: POST
Cancel the jobs running in a machine. machine_id is the MAC Address.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/cancel_job
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/cancel_job
Endpoint /dut/<machine_id>/retrain
Method: POST
Forces re-run training in a machine. machine_id is the MAC Address.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/retrain
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/retrain
Endpoint /dut/<machine_id>/skip_training
Method: POST
Forces to skip training in a machine. machine_id is the MAC Address.
curl -X POST localhost:8000/api/v1/dut/<machine_id>/skip_training
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/skip_training
Endpoint /pdus
Method: GET
Lists the available PDUS and the list of their port_ids with some information such as label or state.
curl localhost:8000/api/v1/pdus
Endpoint /pdu/<pdu_name>
Method: GET
Lists all the information of a selected PDU
curl localhost:8000/api/v1/pdu/<pdu_name>
curl localhost:8000/api/v1/pdu/VPDU
Endpoint /pdu/<pdu_name>/port/<port_id>
Method: GET
Lists the information of a port_id: label, min_off_time and state
curl localhost:8000/api/v1/pdu/<pdu_name>/port/<port_id>
curl localhost:8000/api/v1/pdu/VPDU/port/10
Method: PATCH
Turns a port OFF or ON.
curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \
-H 'Content-Type: application/json' \
-d '{"state": "on"}'
Reserve or un-reserve a port. Use True to reserve, False to un-reserve.
curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \
-H 'Content-Type: application/json' \
-d '{"reserved": True}'
Endpoint /full-state
Method: GET
Provides all the information from the endpoints /pdus
, /duts
, and
/dut/discover
in a single call.
Endpoint /jobs
Method: POST
Used to submit jobs. To be documented.
MarsDB
MarsDB is the database for all the runtime data of the CI instance:
List of PDUs connected
List of test machines
List of Gitlab instances where to expose the test machines
Its location is set using the MARS_DB_FILE
environment variable, and is
live-editable. This means you can edit the file directly and changes will
be reflected instantly in the executor.
Machines can be added to MarsDB by POSTing or PUTing to the /api/v1/dut/
REST endpoint. Fields in the REST API match the ones found in the database,
but some fields cannot be set at the creation of the machine for safety reasons as
we want to enforce a separation between fields that are meant to be auto-generated
and the ones that are meant to be manually-configured (denoted by the (MANUAL)
tag in the DB file description below).
The most prominent manual fields are pdu
and pdu_port
, which means the
a newly-added machine won’t be usable until manually associated to its PDU port by
manually editing the DB file. An easier solution to enroll a new machine is to use
the discovery process by POSTing to the /api/v1/dut/discover
endpoint the pdu
and pdu_port_id
fields. This will initiate the discovery sequence where the
executor will turn this port ON, wait for the machine to register itself, then
automatically add associate the machine to the PDU port specified in the discovery
process. Using the discovery process allows a machine to go through the TRAINING
process without further manual intervention.
Here is an annotated sample file, where AUTO
means you should not be
modifying this value (and all children of it) while MANUAL
means that
you are expected to set these values by editing the DB file manually, or
through the REST
interface.
All the other values should be machine-generated, for example using
the machine_registration
container:
pdus: # List of all the power delivery units (MANUAL)
APC: # Name of the PDU
driver: apc_masterswitch # The [driver of your PDU](pdu/README.md)
config: # The configuration of the driver (driver-dependent)
hostname: 10.0.0.2
VPDU: # A virtual PDU, spawning virtual machines
driver: vpdu
config:
hostname: localhost:9191
reserved_port_ids: [] # List of reserved ports in the PDU where no virtual DUT can be added (DASHBOARD)
duts: # List of all the test machines
de:ad:be:ef:ca:fe: # MAC address of the machine
base_name: gfx9 # Most significant characteristic of the machine. Basis of the auto-generated name
tags: # List of tags representing the machine
- amdgpu:architecture:GCN5.1
- amdgpu:family:RV
- amdgpu:codename:RENOIR
- amdgpu:gfxversion:gfx9
- amdgpu:APU
- amdgpu:pciid:0x1002:0x1636
ip_address: 192.168.0.42 # IP address of the machine
local_tty_device: ttyUSB0 # Test machine's serial port to talk to the gateway
gitlab: # List of GitLab instances to expose this runner on
freedesktop: # Parameters for the `freedesktop` GitLab instance
token: <token> # Token given by the registration process (AUTO)
exposed: true # Should this machine be exposed on `freedesktop`? (MANUAL)
runner_id: 4242 # GitLab's runner ID associated to this machine
pdu: APC # Name of the PDU to contact to turn ON/OFF this machine (MANUAL/DASHBOARD)
pdu_port_id: 1 # ID of the port where the machine is connected (MANUAL/DASHBOARD)
pdu_off_delay: 30 # How long should the PDU port be off when rebooting the machine? (DASHBOARD)
ready_for_service: true # The machine has been tested and can now be used by users (AUTO/DASHBOARD)
is_retired: false # The user specified that the machine is no longer in use
first_seen: 2021-12-22 16:57:08.146275 # When was the machine first seen in CI (AUTO)
comment: null # Field used to add a quick note about a DUT for admins (MANUAL/DASHBOARD)
gitlab: # Configuration of anything related to exposing the machines on GitLab (MANUAL)
freedesktop: # Name of the gitlab instance
url: https://gitlab.freedesktop.org/ # URL of the instance
registration_token: <token> # Registration token, as found in your GitLab project/group/instance settings
access_token: <token> # A read-only API token, used to verify consistency between the local and gitlab state
expose_runners: true # Expose the test machines on this instance? Handy for quickly disabling all machines
maximum_timeout: 21600 # Maximum timeout allowed for any job running on our test machines
gateway_runner: # Expose a runner that will run locally, and not on test machines
token: <token> # Token given by the registration process (AUTO)
exposed: true # Should the gateway runner be exposed?
runner_id: 4243 # GitLab's runner ID associated to this machine
Frequently asked questions
How do I move runners from one GitLab project to another?
There are currently no easy ways of doing so currently. The best solution is to call the following command line for every runner in MaRS DB:
$ curl -X DELETE "https://gitlab.example.com/api/v4/runners" --form "token=<token>"
The executor will periodically check the validity of the tokens, and upon seeing they got deleted, it will re-create them in the new project.