Executor

The Valve Infra executor is the service that coordinates different services to enable time-sharing test machines, AKA devices under test (DUTs).

This service can be interacted with using executorctl, our client, and/or our REST API.

The executor coordinates the different states for the DUTs, here is a flow chart between the different states.

graph TD subgraph "DUT state machine" START --> |is retired?| RETIRED START --> |is marked ready for service?| QUICK_CHECK QUICK_CHECK --> |Success| IDLE QUICK_CHECK --> |Failed| TRAINING TRAINING --> |Failed| TRAINING TRAINING --> |Success| IDLE RETIRED --> |Activate| QUICK_CHECK IDLE --> |Retire| RETIRED IDLE --> |Job received| QUEUED QUEUED --> RUNNING RUNNING --> IDLE end

DUT state machine

Let’s see what every state of a DUT means:
  • IDLE: The device is available (but powered down to save energy), waiting for a job.

  • TRAINING: The device is being tested for boot reliability (20 rounds by default).

  • RETIRED: The device is undergoing maintenance, and cannot accept jobs.

  • QUICK_CHECK: The device is verifying that its current configuration matches what is described in the database.

  • QUEUED: The device has been chosen to execute a job, but the executor isn’t ready just yet (expected to last <1s)

  • RUNNING: The device is running a job.

Configuration

The executor service is configured through the use of environment variables.

Here are the relevant options to most deployment:

  • BOOTS_TFTP_ROOT: Cache folder for boot-related artifacts (default: /mnt/tmp/boots/tftp)

  • BOOTS_DEFAULT_*: See Default boot configuration for more details.

  • EXECUTOR_URL: HTTP url of the executor service, reachable locally and from the test machines (default: http://ci-gateway)

  • EXECUTOR_ARTIFACT_CACHE_ROOT: Folder to use as a cache for the kernel/initrd artifacts used by the jobs (recommended, default: None)

  • FARM_NAME: Name of the test form (mandatory, default: None)

  • GITLAB_CONF_FILE: Path to the gitlab runner configuration file, which will be overridden as new test machines are added to the farm (default: /etc/gitlab-runner/config.toml)

  • GITLAB_CONF_TEMPLATE_FILE: Template to use for the creation of the gitlab runner configuration file (default: $package_dir/templates/gitlab_runner_config.toml.j2)

  • MARS_DB_FILE: Path to the database (default: /mnt/permanent/mars_db.yaml)

  • MINIO_URL: URL to the local minio service, accessible both locally and by test machines (default: http://ci-gateway:9000`)

  • MINIO_ROOT_USER: Admin username for the local minio service (default minioadmin)

  • MINIO_ROOT_PASSWORD: Admin password for the local minio service (default minio-root-password)

  • PRIVATE_INTERFACE: Network interface connected to the DUTs’ network (default: private)

  • SALAD_URL: URL to the salad service (default: http://ci-gateway:8005)

  • SERGENT_HARTMAN_BOOT_COUNT: How many rounds of testing should be used to qualify a test machine (default: 100)

  • SERGENT_HARTMAN_QUALIFYING_BOOT_COUNT: How many successful rounds of testing should be used to qualify a test machine (default: 100)

  • SERGENT_HARTMAN_REGISTRATION_RETRIAL_DELAY: How many seconds should be waited after an unsuccessful registration attempt before trying another one (default: 120)

And here are the lower-level options:

  • BOOTS_DISABLE_SERVERS: Set to a non-empty value to disable netbooting services (DHCP and TFP). (default: None)

  • CONSOLE_PATTERN_DEFAULT_MACHINE_UNFIT_FOR_SERVICE_REGEX: Automatically tag a DUT as unfit for service if it generates a line matched by this regular expression (default: None)

  • EXECUTOR_HOST: Binding address for the HTTP service (default: 0.0.0.0)

  • EXECUTOR_PORT: Binding port for the HTTP service (default: 80)

  • EXECUTOR_REGISTRATION_JOB: Local path to the registration job (default: $package_dir/job_templates/register.yml.j2)

  • EXECUTOR_BOOTLOOP_JOB: Local path to the registration job (default: $package_dir/job_templates/bootloop.yml.j2)

  • EXECUTOR_VPDU_ENDPOINT: Automatically add a virtual PDU for local testing (format: host:port, default: None)

  • MINIO_ADMIN_ALIAS: Alias set up by the executor to refer to the minio instanced specified by MINIO_URL, MINIO_ROOT_USER, and MINIO_ROOT_PASSWORD (default: local)

Default boot configuration

When an unsolicited boot request is received by the executor (eg. an admin added a new test machine), it needs to know which kernel/initrd/cmdline this test machine needs to run in order to complete its registration.

Here are the most relevant options:

  • BOOTS_DEFAULT_KERNEL: Default kernel to use to boot unknown test machines (default: http://ci-gateway:9000/boot/default_kernel)

  • BOOTS_DEFAULT_INITRD: Default initramfs to use to boot unknown test machines (default: http://ci-gateway:9000/boot/default_boot2container.cpio.xz)

  • BOOTS_DEFAULT_CMDLINE: Default kernel command line to use to boot unknown test machines (default: b2c.container=”-ti –tls-verify=false docker://ci-gateway:8002/mupuf/valve-infra/machine_registration:latest register” b2c.ntp_peer=”ci-gateway” b2c.cache_device=none loglevel=6)

However, since no single kernel/initramfs may be suitable for all the possible DUTs, the executor will look for the most suitable value by checking its environment variables in the following order:

  1. BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]

  2. BOOTS_DEFAULT_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]

  3. BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_[KERNEL|INITRD|CMDLINE]

  4. BOOTS_DEFAULT_${ARCH}_[KERNEL|INITRD|CMDLINE]

  5. BOOTS_DEFAULT_${BOOTLOADER}_[KERNEL|INITRD|CMDLINE]

  6. BOOTS_DEFAULT_[KERNEL|INITRD|CMDLINE]

With the variables taking the following values:

  • ${BOOTLOADER}: IPXE

  • ${ARCH}**: I386, X86_64, ARM32, ARM64

  • ${PLATFORM}: PCBIOS, EFI

Example: The following options specify how to boot x86_64 (PCBIOS or EFI) and ARM64 (EFI-only) test machines. Please note how the same command line is used for all configurations, and how the ARM64 architecture only has a kernel specified for the EFI platform while the same kernel will be served for both the EFI and PCBIOS platforms.

  • BOOTS_DEFAULT_X86_64_KERNEL: https://ci-gateway:9000/boot/default_x86_64_kernel

  • BOOTS_DEFAULT_X86_64_INITRD: https://ci-gateway:9000/boot/default_x86_64_initrd

  • BOOTS_DEFAULT_ARM64_EFI_KERNEL: https://ci-gateway:9000/boot/default_arm64_kernel.efi

  • BOOTS_DEFAULT_ARM64_INITRD: https://ci-gateway:9000/boot/default_arm64_initrd

  • BOOTS_DEFAULT_CMDLINE: b2c.container=”-ti –tls-verify=false docker://ci-gateway:8002/mupuf/valve-infra/machine_registration:latest register” b2c.ntp_peer=”ci-gateway” b2c.cache_device=none loglevel=6

Executor client - executorctl

The executor client can be found in git under executor/client and installed with pip.

It can be used to queue a job on a DUT from the command line, when its state is IDLE:

$ executorctl run -t $machine_tag $/path/to/job/file

Here is an extract of the command line for executorctl run :

usage: Executor client run [-h] [-w] [-c CALLBACK] [-t MACHINE_TAGS] [-i MACHINE_ID] [-s SHARE_DIRECTORY] [-j JOB_ID] [-a MINIO_AUTH] [-g MINIO_GROUP] job

positional arguments:
job                   Job that should be run

options:
-h, --help            show this help message and exit
-w, --wait            Wait for a machine to become available if all are busy
-c CALLBACK, --callback CALLBACK
                        Hostname that the executor will use to connect back to this client, useful for non-trivial routing to the test device
-t MACHINE_TAGS, --machine-tag MACHINE_TAGS
                        Tag of the machine that should be running the job. Overrides the job's target.
-i MACHINE_ID, --machine-id MACHINE_ID
                        ID of the machine that should run the job. Overrides the job's target.
-s SHARE_DIRECTORY, --share-directory SHARE_DIRECTORY
                        Directory that will be forwarded to the job, and whose changes will be forwarded back to
-j JOB_ID, --job-id JOB_ID
                        Identifier for the job, if you have one already.
-a MINIO_AUTH, --minio-auth MINIO_AUTH
                        MinIO credentials that has access to all the groups specified using '-g'
-g MINIO_GROUP, --minio-group MINIO_GROUP
                        Add the MinIO job user to the specified group. Requires valid credentials specified using '--minio-auth' which already have access this group

Examples of job that can be run under vivian can be found at job_templates

TODO Properly document the job description and file format

REST API

The executor includes a REST API with various endpoints available.

Endpoint /duts

Method: GET

Lists the available machines and their information (IP address, tags, …)

curl localhost:8000/api/v1/duts

Endpoint /dut/

Method: POST, PUT

Adds a new machine to MARS_DB_FILE, if there is a discovery process on-going it’ll use this data to set the PDU and port_id.

This endpoint is used from the machine_registration.py script.

Endpoint /dut/<machine_id>

Method: GET

Lists all the information of a selected machine. machine_id is the MAC Address.

curl localhost:8000/api/v1/dut/<machine_id>
curl localhost:8000/api/v1/dut/52:54:00:11:22:0a

Method: DELETE

Remove the machine from the database, and all its associated GitLab runner tokens.

curl -X DELETE localhost:8000/api/v1/dut/<machine_id>

Method: PATCH

Updates the pdu_off_delay time. Value are seconds. Updates the comment of the machine. Value is a string.

curl -X PATCH localhost:8000/api/v1/dut/52:54:00:11:22:0a \
    -H 'Content-Type: application/json' \
    -d '{"pdu_off_delay": 10, "comment": "this is an example comment"}'

Endpoint /duts/<machine_id>/boot.ipxe

Method: GET

TODO: To be documented.

Endpoint /dut/<machine_id>/quick_check

Method: GET

Returns true if a quick check of the machine has been queued, false otherwise.

curl localhost:8000/api/v1/dut/<machine_id>/quick_check

Method: POST

Queue a quick check on the machine. No parameters are needed.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/quick_check

Endpoint /dut/discover

Method: GET

Shows if there is a discovery process on-going and the data of this discovery: pdu, port_id and start date.

curl localhost:8000/api/v1/dut/discover

Method: POST

Launchs a discovery process, it will boot the machine behind a given PDU/port_id and will put this data in discover_data to be used by the machine_registration.py script.

curl -X POST localhost:8000/api/v1/dut/discover \
    -H 'Content-Type: application/json' \
    -d '{"pdu": "VPDU", "port_id": '10'}'

If no machines show up, the discovery process will automatically timeout after 150 seconds by default. This value can be specified using the timeout parameter:

curl -X POST localhost:8000/api/v1/dut/discover \
    -H 'Content-Type: application/json' \
    -d '{"pdu": "VPDU", "port_id": '10', "timeout": '60'}'

Method: DELETE

Erases all the discovery data, discover_data will be emptied.

curl -X DELETE localhost:8000/api/v1/dut/discover

Endpoint /dut/<machine_id>/retire

Method: POST

Marks as retired a machine. machine_id is the MAC Address.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/retire
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/retire

Endpoint /dut/<machine_id>/activate

Method: POST

Unmarks as retired a machine. machine_id is the MAC Address.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/activate
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/activate

Endpoint /dut/<machine_id>/cancel_job

Method: POST

Cancel the jobs running in a machine. machine_id is the MAC Address.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/cancel_job
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/cancel_job

Endpoint /dut/<machine_id>/retrain

Method: POST

Forces re-run training in a machine. machine_id is the MAC Address.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/retrain
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/retrain

Endpoint /dut/<machine_id>/skip_training

Method: POST

Forces to skip training in a machine. machine_id is the MAC Address.

curl -X POST localhost:8000/api/v1/dut/<machine_id>/skip_training
curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/skip_training

Endpoint /pdus

Method: GET

Lists the available PDUS and the list of their port_ids with some information such as label or state.

curl localhost:8000/api/v1/pdus

Endpoint /pdu/<pdu_name>

Method: GET

Lists all the information of a selected PDU

curl localhost:8000/api/v1/pdu/<pdu_name>
curl localhost:8000/api/v1/pdu/VPDU

Endpoint /pdu/<pdu_name>/port/<port_id>

Method: GET

Lists the information of a port_id: label, min_off_time and state

curl localhost:8000/api/v1/pdu/<pdu_name>/port/<port_id>
curl localhost:8000/api/v1/pdu/VPDU/port/10

Method: PATCH

Turns a port OFF or ON.

curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \
    -H 'Content-Type: application/json' \
    -d '{"state": "on"}'

Reserve or un-reserve a port. Use True to reserve, False to un-reserve.

curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \
    -H 'Content-Type: application/json' \
    -d '{"reserved": True}'

Endpoint /full-state

Method: GET

Provides all the information from the endpoints /pdus, /duts, and /dut/discover in a single call.

Endpoint /jobs

Method: POST

Used to submit jobs. To be documented.

MarsDB

MarsDB is the database for all the runtime data of the CI instance:

  • List of PDUs connected

  • List of test machines

  • List of Gitlab instances where to expose the test machines

Its location is set using the MARS_DB_FILE environment variable, and is live-editable. This means you can edit the file directly and changes will be reflected instantly in the executor.

Machines can be added to MarsDB by POSTing or PUTing to the /api/v1/dut/ REST endpoint. Fields in the REST API match the ones found in the database, but some fields cannot be set at the creation of the machine for safety reasons as we want to enforce a separation between fields that are meant to be auto-generated and the ones that are meant to be manually-configured (denoted by the (MANUAL) tag in the DB file description below).

The most prominent manual fields are pdu and pdu_port, which means the a newly-added machine won’t be usable until manually associated to its PDU port by manually editing the DB file. An easier solution to enroll a new machine is to use the discovery process by POSTing to the /api/v1/dut/discover endpoint the pdu and pdu_port_id fields. This will initiate the discovery sequence where the executor will turn this port ON, wait for the machine to register itself, then automatically add associate the machine to the PDU port specified in the discovery process. Using the discovery process allows a machine to go through the TRAINING process without further manual intervention.

Here is an annotated sample file, where AUTO means you should not be modifying this value (and all children of it) while MANUAL means that you are expected to set these values by editing the DB file manually, or through the REST interface. All the other values should be machine-generated, for example using the machine_registration container:

pdus:                                        # List of all the power delivery units (MANUAL)
  APC:                                       # Name of the PDU
    driver: apc_masterswitch                 # The [driver of your PDU](pdu/README.md)
    config:                                  # The configuration of the driver (driver-dependent)
      hostname: 10.0.0.2
  VPDU:                                      # A virtual PDU, spawning virtual machines
    driver: vpdu
    config:
      hostname: localhost:9191
    reserved_port_ids: []                    # List of reserved ports in the PDU where no virtual DUT can be added (DASHBOARD)
duts:                                        # List of all the test machines
  de:ad:be:ef:ca:fe:                         # MAC address of the machine
    base_name: gfx9                          # Most significant characteristic of the machine. Basis of the auto-generated name
    tags:                                    # List of tags representing the machine
    - amdgpu:architecture:GCN5.1
    - amdgpu:family:RV
    - amdgpu:codename:RENOIR
    - amdgpu:gfxversion:gfx9
    - amdgpu:APU
    - amdgpu:pciid:0x1002:0x1636
    ip_address: 192.168.0.42                 # IP address of the machine
    local_tty_device: ttyUSB0                # Test machine's serial port to talk to the gateway
    gitlab:                                  # List of GitLab instances to expose this runner on
      freedesktop:                           # Parameters for the `freedesktop` GitLab instance
        token: <token>                       # Token given by the registration process (AUTO)
        exposed: true                        # Should this machine be exposed on `freedesktop`? (MANUAL)
        runner_id: 4242                      # GitLab's runner ID associated to this machine
    pdu: APC                                 # Name of the PDU to contact to turn ON/OFF this machine (MANUAL/DASHBOARD)
    pdu_port_id: 1                           # ID of the port where the machine is connected (MANUAL/DASHBOARD)
    pdu_off_delay: 30                        # How long should the PDU port be off when rebooting the machine? (DASHBOARD)
    ready_for_service: true                  # The machine has been tested and can now be used by users (AUTO/DASHBOARD)
    is_retired: false                        # The user specified that the machine is no longer in use
    first_seen: 2021-12-22 16:57:08.146275   # When was the machine first seen in CI (AUTO)
    comment: null                            # Field used to add a quick note about a DUT for admins (MANUAL/DASHBOARD)
gitlab:                                      # Configuration of anything related to exposing the machines on GitLab (MANUAL)
  freedesktop:                               # Name of the gitlab instance
    url: https://gitlab.freedesktop.org/     # URL of the instance
    registration_token: <token>              # Registration token, as found in your GitLab project/group/instance settings
    access_token: <token>                    # A read-only API token, used to verify consistency between the local and gitlab state
    expose_runners: true                     # Expose the test machines on this instance? Handy for quickly disabling all machines
    maximum_timeout: 21600                   # Maximum timeout allowed for any job running on our test machines
    gateway_runner:                          # Expose a runner that will run locally, and not on test machines
      token: <token>                         # Token given by the registration process (AUTO)
      exposed: true                          # Should the gateway runner be exposed?
      runner_id: 4243                        # GitLab's runner ID associated to this machine

Frequently asked questions

  • How do I move runners from one GitLab project to another?

There are currently no easy ways of doing so currently. The best solution is to call the following command line for every runner in MaRS DB:

$ curl -X DELETE "https://gitlab.example.com/api/v4/runners" --form "token=<token>"

The executor will periodically check the validity of the tokens, and upon seeing they got deleted, it will re-create them in the new project.