Running the step-based photogrammetry workflow
Recommended Workflow
This is the recommended workflow for photogrammetry processing. It provides optimized resource allocation, cost savings, and better monitoring compared to the original monolithic workflow.
This guide describes how to run the OFO step-based photogrammetry workflows, which split Metashape processing and postprocessing into two independent workflows with optimized CPU/GPU node allocation. The workflows use automate-metashape for photogrammetry and a separate postprocessing container for derived products.
The two workflows are:
metashape-workflow.yamlβ Runs the 10 Metashape processing steps plus S3 uploadpostprocessing-workflow.yamlβ Runs postprocessing (CHMs, COGs, thumbnails) on completed Metashape outputs
This separation allows you to run postprocessing independently (e.g., rerun with different settings) without redoing expensive Metashape processing.
Key Benefits
- π― GPU steps (match_photos, build_depth_maps, build_mesh) run on expensive GPU nodes only when needed
- π» CPU steps (align_cameras, build_point_cloud, build_dem_orthomosaic, etc.) run on cheaper CPU nodes
- β‘ Disabled steps are completely skipped (no pod creation, no resource allocation)
- π Fine-grained monitoring - Track progress of each individual step in the Argo UI
- π§ Flexible GPU usage - Configure whether GPU-capable steps use GPU or CPU nodes
- π° Cost optimization - Reduce GPU usage by 60-80% compared to monolithic workflow
Prerequisites
Before running the workflow, ensure you have:
- Installed and set up the
openstackandkubectlutilities - Installed the Argo CLI
- Added the appropriate type and number of nodes to the cluster
- Set up your
kubectlauthentication env var (part of instructions for adding nodes). Quick reference:
source ~/venv/openstack/bin/activate
source ~/.ofocluster/app-cred-ofocluster-openrc.sh
export KUBECONFIG=~/.ofocluster/ofocluster.kubeconfig
Workflow overview
Metashape Workflow (metashape-workflow.yaml)
Executes 10 separate Metashape processing steps as individual containerized tasks, followed by S3 upload:
- setup (CPU) - Initialize project, add photos, calibrate reflectance
- match_photos (GPU/CPU configurable) - Generate tie points for camera alignment
- align_cameras (CPU) - Align cameras, add GCPs, optimize, filter sparse points
- build_depth_maps (GPU) - Create depth maps for dense reconstruction
- build_point_cloud (CPU) - Generate dense point cloud from depth maps
- build_mesh (GPU/CPU configurable) - Build 3D mesh model
- build_dem_orthomosaic (CPU) - Create DEMs and orthomosaic products
- match_photos_secondary (GPU/CPU configurable, optional) - Match secondary photos if provided
- align_cameras_secondary (CPU, optional) - Align secondary cameras if provided
- finalize (CPU) - Cleanup, generate reports
- rclone-upload-task - Upload Metashape outputs to S3
- cleanup-project - Remove temporary project directory
Postprocessing Workflow (postprocessing-workflow.yaml)
Runs independently on projects that have completed Metashape processing:
- postprocessing-task - Generate CHMs, clip to boundaries, create COGs and thumbnails, upload to S3
- cleanup-project - Remove temporary project directory
Sequential Execution
Steps execute sequentially within each mission to prevent conflicts with shared Metashape project files. However, multiple missions process in parallel, each with its own step sequence.
Automatic Cleanup: After each workflow completes successfully for a project, it automatically removes the temporary project directory ({TEMP_WORKING_DIR}/{workflow-name}/{project-name}/) to free disk space.
Conditional Execution
Steps disabled in your config file are completely skipped - no container is created and no resources are allocated. This is more efficient than the original workflow where disabled operations still ran inside a single long-running container.
Project Name Requirements
Project names must be safe for shell and filesystem use:
- Must start and end with alphanumeric characters
- Can contain alphanumeric characters, dots, hyphens, and underscores
- Pattern:
^[a-zA-Z0-9][a-zA-Z0-9._-]*[a-zA-Z0-9]$
The project name is used directly for working directories: {TEMP_WORKING_DIR}/{workflow-name}/{project-name}/
Setup
Prepare inputs
Before running the workflow, you need to prepare three types of inputs on the cluster's shared storage:
- Drone imagery datasets (JPEG images)
- Metashape configuration files
- A config list file specifying which configs to process
All inputs must be placed in /ofo-share/argo-data/.
Add drone imagery datasets
To add new drone imagery datasets to be processed using Argo, transfer files from your local machine (or the cloud) to the /ofo-share volume. Put the drone imagery datasets to be processed in their own directory in /ofo-share/argo-data/argo-input/datasets (or another folder within argo-input).
One data transfer method is the scp command-line tool:
scp -r <local/directory/drone_image_dataset/> exouser@<vm.ip.address>:/ofo-share/argo-data/argo-input/datasets
Replace <vm.ip.address> with the IP address of a cluster node that has the share mounted.
Specify Metashape parameters
Config Structure Requirement
The step-based workflow requires an updated config structure with:
- Global settings under
project:section - Each operation as a top-level config section with
enabledflag - Separate
match_photosandalign_camerassections (not combinedalignPhotos) - Separate
build_demandbuild_orthomosaicsections
See the updated config example for the full structure.
Metashape processing parameters are specified in configuration YAML files which should be placed somewhere within /ofo-share/argo-data/argo-input.
Every project to be processed needs to have its own standalone configuration file.
Setting the photo_path: Within the project: section of the config YAML, you must specify photo_path which is
the location of the drone imagery dataset. When running via Argo workflows, this path refers to the
location inside the docker container. The /ofo-share/argo-data directory gets mounted at /data inside the container, so for example, if your drone images are at
/ofo-share/argo-data/argo-input/datasets/dataset_1, then the photo_path should be written as:
Downloading imagery from S3 (optional)
Instead of pre-staging imagery on the shared PVC, you can have the workflow automatically download and extract imagery zip files from S3 at runtime. This is useful for:
- Cloud-native workflows: Process imagery stored in S3 without manual uploads
- One-time processing: Imagery that doesn't need to persist after the workflow
- Remote collaboration: Team members can trigger workflows without PVC access
When to use S3 imagery download
Use this feature when:
- Your imagery is already stored as zip files in S3
- You want to avoid manual file transfers to the cluster
- You're processing imagery that won't be reused
Don't use this feature when:
- Your imagery is already on the PVC (use direct paths instead)
- You need to reprocess the same imagery multiple times (pre-staging is more efficient)
- Your zip files are very large and bandwidth is a concern
Configuration
Add the following to the argo: section of your config file:
argo:
# List of S3 zip files to download (can also be a single string)
s3_imagery_zip_download:
- ofo-public/drone/missions_01/000558/images/000558_images.zip
- ofo-public/drone/missions_01/000559/images/000559_images.zip
# Whether to delete downloaded imagery after workflow completes (default: true)
cleanup_downloaded_imagery: true
| Parameter | Description | Default |
|---|---|---|
s3_imagery_zip_download |
S3 path(s) of zip files to download. Can be a single string or a list. Format: bucket/path/file.zip. The S3 endpoint and credentials are configured in the cluster's s3-credentials Kubernetes secret. |
(none) |
cleanup_downloaded_imagery |
If true, downloaded imagery is deleted after photogrammetry completes to free disk space |
true |
Path syntax: The __DOWNLOADED__ prefix
When using S3 imagery download, reference downloaded files in photo_path using the __DOWNLOADED__ prefix:
project:
project_name: my_forest_plot
photo_path:
- __DOWNLOADED__/000558_images/000558-01
- __DOWNLOADED__/000558_images/000558-02
- __DOWNLOADED__/000559_images/000559-01
The workflow automatically replaces __DOWNLOADED__ with the actual download location before photogrammetry begins.
Zip file structure requirements
The zip filename (without .zip extension) becomes the extraction folder name. Plan your photo_path entries accordingly:
Example: Downloading 000558_images.zip containing:
000558_images.zip
βββ 000558-01/
β βββ IMG_0001.jpg
β βββ IMG_0002.jpg
βββ 000558-02/
βββ IMG_0001.jpg
βββ IMG_0002.jpg
Results in this structure after extraction:
{download_dir}/
βββ 000558_images/ β folder name from zip filename
βββ 000558-01/
β βββ IMG_0001.jpg
β βββ IMG_0002.jpg
βββ 000558-02/
βββ IMG_0001.jpg
βββ IMG_0002.jpg
Reference these paths as:
Complete example configuration
argo:
# S3 imagery download settings
s3_imagery_zip_download:
- ofo-public/drone/missions_01/000558/images/000558_images.zip
cleanup_downloaded_imagery: true
# Standard workflow settings
match_photos:
gpu_enabled: true
gpu_resource: "nvidia.com/mig-1g.5gb"
cpu_request: "4"
memory_request: "16Gi"
build_depth_maps:
gpu_resource: "nvidia.com/mig-2g.10gb"
project:
project_name: mission_000558
# Reference downloaded imagery with __DOWNLOADED__ prefix
photo_path:
- __DOWNLOADED__/000558_images/000558-01
- __DOWNLOADED__/000558_images/000558-02
# ... rest of Metashape config sections ...
match_photos:
enabled: true
# ...
How it works
When s3_imagery_zip_download is specified, the workflow adds these steps before photogrammetry:
- download-imagery: Downloads each zip file from S3 using rclone and extracts it
- transform-config: Replaces
__DOWNLOADED__inphoto_pathwith the actual download location
After all processing completes (including upload and postprocessing):
- cleanup-imagery (if enabled): Deletes the downloaded imagery to free disk space
Each project gets its own isolated download directory to prevent collisions when processing multiple projects in parallel.
Troubleshooting S3 imagery download
"Config validation failed: DOWNLOADED prefix used but no downloads specified"
Cause: Your photo_path contains __DOWNLOADED__ but s3_imagery_zip_download is empty or missing.
Solution: Either add s3_imagery_zip_download entries, or change photo_path to use direct paths (e.g., /data/...).
"Config validation failed: Downloads specified but no DOWNLOADED paths found"
Cause: You specified s3_imagery_zip_download but your photo_path entries don't use the __DOWNLOADED__ prefix.
Solution: Update photo_path to use __DOWNLOADED__/... paths that reference your downloaded zip contents.
Download fails with "Failed to copy" or timeout errors
Possible causes:
- Incorrect S3 path format (should be
bucket/path/file.zipwithout a remote prefix) - S3 credentials not configured in the cluster's
s3-credentialssecret - Network issues or S3 endpoint unavailable
- Zip file doesn't exist at the specified path
Debug steps:
- Check the
download-imagerystep logs in Argo UI - Verify the S3 path is correct by listing files (requires rclone configured with the same credentials):
"Photo path not found" errors in setup step
Cause: The extracted zip structure doesn't match your photo_path entries.
Solution:
- Check what's actually inside your zip file
- Ensure
photo_pathmatches the extracted folder structure - Remember: zip filename (minus
.zip) becomes the top-level folder
Disk space issues
Cause: Downloaded imagery fills up the shared storage.
Solutions:
- Ensure
cleanup_downloaded_imagery: true(default) to auto-delete after completion - Process fewer projects in parallel to reduce concurrent disk usage
- Monitor disk usage during workflow execution
Resource request configuration
All Argo workflow resource requests (GPU, CPU, memory) are configured in the top-level argo section of your automate-metashape config file. The defaults assume one or more JS2 m3.large CPU nodes and one or more mig1 (7-slice MIG g3.xl) GPU nodes (see cluster access and resizing).
Importantly, using well-selected resource requests may allow more than one workflow step to schedule simultaneously on the same compute node, without substantially extending the compute time of either, thus greatly increasing compute efficiency by requiring fewer compute nodes. The example config YAML includes suggested resource requests we have developed through extensive benchmarking.
GPU scheduling
Three steps support configurable GPU usage via argo.<step>.gpu_enabled parameters:
argo.match_photos.gpu_enabled- Iftrue, runs on GPU node; iffalse, runs on CPU node (default:true)argo.build_mesh.gpu_enabled- Iftrue, runs on GPU node; iffalse, runs on CPU node (default:true)argo.match_photos_secondary.gpu_enabled- Inherits frommatch_photosunless explicitly set
The build_depth_maps step always runs on GPU nodes (gpu_enabled cannot be disabled) as it always benefits from GPU acceleration. However, you can configure the GPU resource type and count using gpu_resource and gpu_count.
GPU resource selection (MIG Support)
For GPU steps, you can specify which GPU resource to request using gpu_resource and gpu_count in the argo section. This allows using MIG (Multi-Instance GPU) partitions instead of full GPUs:
argo:
match_photos:
gpu_enabled: true
gpu_resource: "nvidia.com/mig-1g.5gb" # Use smallest MIG partition
gpu_count: 2 # Request 2 MIG slices for more parallelism
build_depth_maps:
gpu_resource: "nvidia.com/gpu" # Explicitly request full GPU (this is the default)
# gpu_count defaults to 1 if omitted
build_mesh:
gpu_enabled: true
gpu_resource: "nvidia.com/mig-3g.20gb" # Larger MIG partition for mesh building
gpu_count: 1
Available GPU resources:
| Resource | Description | Pods per GPU |
|---|---|---|
nvidia.com/gpu |
Full GPU (default if gpu_resource omitted) |
1 |
nvidia.com/mig-1g.5gb |
1/7 compute, 5GB VRAM | 7 |
nvidia.com/mig-2g.10gb |
2/7 compute, 10GB VRAM | 3 |
nvidia.com/mig-3g.20gb |
3/7 compute, 20GB VRAM | 2 |
Use gpu_count to request multiple MIG slices (e.g., gpu_count: 2 with mig-1g.5gb to get 2/7 compute power).
When to use MIG
Use MIG partitions when your GPU steps have low utilization. This allows multiple workflow steps to share a single physical GPU, reducing costs. In extensive benchmarking, we have found that we get the greatest efficiency with mig-1g.5gb nodes, potentially providing more than one slice to GPU-intensive pods.
Nodegroup requirement
MIG resources are only available on MIG-enabled nodegroups. Create a MIG nodegroup with a name containing mig1-, mig2-, or mig3- (see MIG nodegroups).
CPU and memory configuration
You can configure CPU and memory requests for all workflow steps (both CPU and GPU steps) using cpu_request and memory_request parameters in the argo section:
argo:
# Optional: Set global defaults that apply to all steps
defaults:
cpu_request: "10" # Default CPU cores for all steps
memory_request: "50Gi" # Default memory for all steps
# Override for specific steps
match_photos:
cpu_request: "8" # Override default CPU request for this step
memory_request: "32Gi" # Override default memory request for this step
build_depth_maps:
cpu_request: "6"
memory_request: "24Gi"
align_cameras:
cpu_request: "15" # CPU-heavy step
memory_request: "50Gi"
Default values (if not specified) are hard-coded into the workflow YAML under the CPU and GPU step templates.
Fallback order:
- Step-specific value (e.g.,
argo.match_photos.cpu_request) - User default from
argo.defaults(if specified) - Hardcoded default (based on step type and GPU mode)
Using defaults as a template
You can leave step-level parameters blank/empty to use the defaults, which serves as a visual template:
argo:
defaults:
cpu_request: "8"
memory_request: "40Gi"
match_photos:
cpu_request: # Blank = uses defaults.cpu_request β 8
memory_request: # Blank = uses defaults.memory_request β 40Gi
build_depth_maps:
cpu_request: "12" # Override: uses 12 instead of defaults
memory_request: # Blank = uses defaults.memory_request β 40Gi
Secondary photo processing
The match_photos_secondary and align_cameras_secondary steps inherit resource configuration from their primary steps unless explicitly overridden:
argo:
match_photos:
gpu_resource: "nvidia.com/mig-2g.10gb"
cpu_request: "6"
memory_request: "24Gi"
# match_photos_secondary automatically inherits the above settings
# unless you override them:
match_photos_secondary:
gpu_resource: "nvidia.com/mig-1g.5gb" # Override: use smaller GPU
# cpu_request and memory_request still inherited from match_photos
This 4-level fallback applies: Secondary-specific β Primary step β User defaults β Hardcoded defaults
Parameters handled by Argo: The project_path, output_path, and project_name configuration parameters are handled automatically by the Argo workflow:
project_pathandoutput_pathare determined via CLI arguments passed to the automate-metashape container, derived from theTEMP_WORKING_DIRArgo workflow parameter (passed by the user on the command line when invokingargo submit)project_nameis extracted fromproject.project_namein the config file (or from the filename of the config file if missing in the config) and passed by Argo via CLI to each step to ensure consistent project names per mission
Any values specified for project_path and output_path in the config.yml will be overridden by Argo CLI arguments.
Create a config list file
We use a text file, for example config-list.txt, to tell the Argo workflow which config files
should be processed in the current run. Place this file in the same directory as your config files, then list just the filenames (not full paths), one per line.
Example: If your configs are in /ofo-share/argo-data/argo-input/configs/, create a file at /ofo-share/argo-data/argo-input/configs/config-list.txt:
# Benchmarking missions
01_benchmarking-greasewood.yml
02_benchmarking-greasewood.yml
# Skipping emerald for now
# 01_benchmarking-emerald-subset.yml
# 02_benchmarking-emerald-subset.yml
03_production-run.yml # high priority
Features:
- Filenames only: List just the config filename; the directory is inferred from the config list's location
- Comments: Lines starting with
#(after whitespace) are skipped - Inline comments: Text after
#on any line is ignored (e.g.,config.yml # note) - Blank lines: Empty lines are ignored for readability
- Backward compatibility: Absolute paths (starting with
/) still work if needed
The project name will be automatically derived from the config filename (e.g., project-name.yml becomes project project-name), unless explicitly set in the config file at project.project_name (which takes priority).
You can create your own config list file and name it whatever you want, placing it anywhere within /ofo-share/argo-data/. Then specify the path to it within the container (using /data/XYZ to refer to /ofo-share/argo-data/XYZ) using the CONFIG_LIST parameter when submitting the workflow.
Determine the maximum number of projects to process in parallel
When tasked with parallelizing across multiple multi-step DAGs, Argo prioritizes breadth first. So when it has a choice, it will start on a new DAG (metashape project) rather than starting the next step of an existing one. This is unfortunately not customizable, and it is undesirable because the workflow involves storing in-process files (including raw imagery, metashape project, outputs) locally during processing. Our shared storage does not have the space to store all files locally at the same time. In addition, we have a limited number of Metashape licenses. So we need to restrict the number of parallel DAGs (metashape projects) it will attempt to run.
The workflow controls this via the parallelism field in the main template (line 66 in
metashape-workflow.yaml or postprocessing-workflow.yaml). To change the max parallel projects, edit this value
directly in the workflow file before submitting. The default is set to 10.
Why not a command-line parameter?
Argo Workflows doesn't support parameter substitution for integer fields like parallelism,
so this value must be hardcoded in the workflow file. This is an known issue with Argo and we
should look for it to be resovled so we can implement it as a command line parameter.
Adjusting parallelism on a running workflow
If you need to increase or decrease parallelism while a workflow is already running, you can patch the workflow in place. First, find your workflow name:
Then patch the main template's parallelism (index 0):
kubectl patch workflow <workflow-name> -n argo --type='json' \
-p='[{"op": "replace", "path": "/spec/templates/0/parallelism", "value": 20}]'
The change takes effect immediately for any new pods that haven't started yet. Already-running pods are not affected.
Note
This only affects the running workflow instance. Future submissions will still use the value from the YAML file.
Submit the workflows
Once your cluster authentication is set up and your inputs are prepared, submit the workflows.
Metashape workflow
argo submit -n argo metashape-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/config-list.txt \
-p TEMP_WORKING_DIR=/data/argo-output/tmp/derek-0202 \
-p S3_BUCKET_INTERNAL=ofo-internal \
-p S3_PHOTOGRAMMETRY_DIR=photogrammetry-outputs_dytest02 \
-p PHOTOGRAMMETRY_CONFIG_ID=03 \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
-p WORKFLOW_UTILS_IMAGE_TAG=latest \
-p AUTOMATE_METASHAPE_IMAGE_TAG=latest
Naming your workflow
You can optionally pass --name "my-workflow-name" to give your workflow a custom name. If omitted, Argo will auto-generate a unique name.
Postprocessing workflow
Run after Metashape completes (or on projects with existing Metashape outputs in S3):
argo submit -n argo postprocessing-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/config-list.txt \
-p TEMP_WORKING_DIR=/data/argo-output/tmp/derek-0202 \
-p S3_BUCKET_INTERNAL=ofo-internal \
-p S3_PHOTOGRAMMETRY_DIR=photogrammetry-outputs_dytest02 \
-p PHOTOGRAMMETRY_CONFIG_ID=03 \
-p S3_BUCKET_PUBLIC=ofo-public \
-p S3_POSTPROCESSED_DIR=drone_dytest02 \
-p S3_BOUNDARY_DIR=drone_dytest02 \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
-p WORKFLOW_UTILS_IMAGE_TAG=latest \
-p POSTPROCESSING_IMAGE_TAG=latest
Postprocessing requires completion log
The postprocessing workflow uses --require-phase metashape internally, so it will only process projects that have a metashape completion entry in the log. Make sure COMPLETION_LOG_PATH is set when submitting the metashape workflow so completions are recorded.
Metashape workflow parameters
| Parameter | Description |
|---|---|
CONFIG_LIST |
Absolute path to text file listing metashape config files. Each line should be a config filename (resolved relative to the config list's directory) or an absolute path. Lines starting with # are comments. Example: /data/argo-input/configs/config-list.txt |
TEMP_WORKING_DIR |
Absolute path for temporary workflow files. Workflow creates {workflow-name}/{project-name}/ subdirectories automatically for each mission. Project directories are automatically deleted after successful upload to free disk space. Example: /data/argo-output/temp-runs/gillan_june27 |
PHOTOGRAMMETRY_CONFIG_ID |
Two-digit configuration ID (e.g., 01, 02) used to organize outputs into photogrammetry_NN subdirectories in S3. If not specified or set to NONE, products are stored without the photogrammetry_NN subfolder. |
S3_BUCKET_INTERNAL |
S3 bucket for internal/intermediate outputs where raw Metashape products (orthomosaics, point clouds, DEMs) are uploaded (typically ofo-internal). |
S3_PHOTOGRAMMETRY_DIR |
S3 directory name for raw Metashape outputs. When PHOTOGRAMMETRY_CONFIG_ID is set, products upload to {S3_BUCKET_INTERNAL}/{S3_PHOTOGRAMMETRY_DIR}/photogrammetry_{PHOTOGRAMMETRY_CONFIG_ID}/. Example: photogrammetry-outputs |
WORKFLOW_UTILS_IMAGE_TAG |
Docker image tag for the argo-workflow-utils container (default: latest). Use a specific branch name or tag to test development versions |
AUTOMATE_METASHAPE_IMAGE_TAG |
Docker image tag for the automate-metashape container (default: latest). Use a specific branch name or tag to test development versions |
LICENSE_RETRY_INTERVAL |
Seconds to wait between license acquisition retries (default: 300 = 5 minutes). See License Retry Behavior |
LICENSE_MAX_RETRIES |
Maximum license retry attempts. Default: 180 (~15 hours at 5-minute intervals). 0 = no retries (fail immediately), -1 = unlimited retries. See License Retry Behavior |
LOG_HEARTBEAT_INTERVAL |
Seconds between heartbeat status lines during Metashape processing (default: 60). Set to 0 to disable filtering and print all Metashape output (original behavior). See Heartbeat Logger and Progress Monitoring |
LOG_BUFFER_SIZE |
Number of recent output lines kept in memory for error context (default: 100). On failure, these lines are dumped to console for immediate debugging. See Heartbeat Logger and Progress Monitoring |
PROGRESS_INTERVAL_PCT |
Percentage interval for progress reporting during Metashape API calls (default: 1). Prints structured [progress] lines at each threshold (e.g., 1%, 2%, 3%). See Heartbeat Logger and Progress Monitoring |
COMPLETION_LOG_PATH |
Path to completion log file for tracking finished projects (default: ""). When set, the workflow logs completed projects and can skip already-completed work. See Completion Tracking and Skip-If-Complete |
SKIP_IF_COMPLETE |
Skip projects that already have a completed metashape phase in the completion log (default: "false"). See Completion Tracking and Skip-If-Complete |
Postprocessing workflow parameters
| Parameter | Description |
|---|---|
CONFIG_LIST |
Same as metashape workflow β the same config list can be used for both |
TEMP_WORKING_DIR |
Absolute path for temporary postprocessing files |
PHOTOGRAMMETRY_CONFIG_ID |
Same config ID used for metashape β determines where to find raw products in S3 |
S3_BUCKET_INTERNAL |
S3 bucket where raw Metashape products are stored (read by postprocessing) |
S3_PHOTOGRAMMETRY_DIR |
S3 directory where raw Metashape outputs are stored |
S3_BUCKET_PUBLIC |
S3 bucket for public/final outputs (postprocessed products) and boundary files (typically ofo-public) |
S3_POSTPROCESSED_DIR |
S3 directory name for postprocessed outputs. Example: drone/missions_03 |
S3_BOUNDARY_DIR |
Parent directory in S3_BUCKET_PUBLIC where mission boundary polygons reside. Example: drone/missions_03 |
WORKFLOW_UTILS_IMAGE_TAG |
Docker image tag for argo-workflow-utils container (default: latest) |
POSTPROCESSING_IMAGE_TAG |
Docker image tag for the photogrammetry-postprocessing container (default: latest) |
COMPLETION_LOG_PATH |
Path to completion log file. Required β the postprocessing workflow uses this to find projects with completed metashape phase |
SKIP_IF_COMPLETE |
Skip projects that already have a completed postprocess phase in the completion log (default: "false") |
Secrets configuration:
- S3 credentials: S3 access credentials, provider type, and endpoint URL are configured via the
s3-credentialsKubernetes secret - Agisoft license: Metashape floating license server address is configured via the
agisoft-licenseKubernetes secret
These secrets should have been created (within the argo namespace) during cluster creation.
License Retry Behavior
Metashape requires a floating license from the Agisoft license server. When multiple workflows compete for limited licenses, some pods may fail to acquire a license at startup. The workflow includes optional retry logic to handle this.
By default, retries are enabled (LICENSE_MAX_RETRIES=180), allowing up to 180 attempts (~15 hours at the default 5-minute interval). To disable retries, set LICENSE_MAX_RETRIES to 0. To retry indefinitely, set it to -1.
How it works (when retries are enabled):
- When a Metashape step starts, it checks for license availability in the first 20 lines of output
- If "license not found" is detected, the process terminates immediately (avoiding wasted compute)
- After waiting
LICENSE_RETRY_INTERVALseconds (default: 300 = 5 minutes), the step retries - This continues until either a license is acquired or
LICENSE_MAX_RETRIESis reached
LICENSE_MAX_RETRIES values:
| Value | Behavior |
|---|---|
180 (default) |
Retry up to 180 times (~15 hours at 5-minute intervals) |
0 |
No retries - fail immediately if no license |
-1 |
Unlimited retries |
>0 |
Retry up to that many times |
Example output when retries are disabled (LICENSE_MAX_RETRIES=0):
[license-wrapper] Starting Metashape workflow (attempt 1)...
No nodelocked license found
License server 149.165.171.237:5842: License not found
[license-wrapper] No license available and retries disabled (LICENSE_MAX_RETRIES=0)
Example output when retries are enabled:
[license-wrapper] Starting Metashape workflow (attempt 1)...
No nodelocked license found
License server 149.165.171.237:5842: License not found
[license-wrapper] No license available. Waiting 300s before retry...
[license-wrapper] Starting Metashape workflow (attempt 2)...
Example output when license is acquired:
[license-wrapper] Starting Metashape workflow (attempt 1)...
No nodelocked license found
License server 149.165.171.237:5842: OK
[license-wrapper] License check passed, proceeding with workflow...
When to adjust retries
- High contention (many parallel workflows): Keep the default (
180) or setLICENSE_MAX_RETRIES=-1for unlimited retries - Low contention / debugging: Set
LICENSE_MAX_RETRIES=0to fail immediately if a license isn't available
Heartbeat Logger and Progress Monitoring
Metashape produces extremely verbose stdout during processing. With many projects running in parallel, this volume of logs taxes the Argo artifact store and k8s control plane. The heartbeat logger reduces console output to ~50-100 lines per multi-hour job while preserving full debugging context on errors.
How It Works
The system has two layers:
- Progress callbacks: Metashape API calls report progress at configurable intervals (controlled by
PROGRESS_INTERVAL_PCT). In sparse mode, progress is folded into heartbeat lines rather than printed separately. In full output mode, structured[progress] step: X%lines print immediately. - Output monitor: The license retry wrapper filters subprocess output, writing the full log to a file on the shared volume while only passing through important lines to the console
Operating Modes
The behavior is controlled by LOG_HEARTBEAT_INTERVAL:
Sparse mode (default, LOG_HEARTBEAT_INTERVAL > 0):
- Console shows only
[license-wrapper]and[monitor]lines, plus periodic heartbeats - Heartbeat includes timestamp, output line count, elapsed time, latest progress percentage, and the most recent Metashape output line
- Progress percentages are folded into heartbeat lines rather than printed separately
- Full log file written to disk with every line (no timestamps added, zero overhead)
- On failure, the last
LOG_BUFFER_SIZElines are dumped to console for immediate debugging
Full output mode (LOG_HEARTBEAT_INTERVAL=0):
- Every line printed to console (original behavior)
[progress]milestones still appear at configured intervals- Full log file still written to disk
- Error buffer still dumped on failure
Console Output Examples
Normal operation (sparse mode):
[license-wrapper] Starting Metashape workflow (attempt 1)...
No nodelocked license found
License server 149.165.171.237:5842: OK
[license-wrapper] License check passed, proceeding with workflow...
[monitor] Full log: /data/.../photogrammetry/metashape-build_depth_maps.log
[heartbeat] 14:32:15 | output lines: 247 | elapsed: 60s | buildDepthMaps: 20% | last: Processing depth map for camera 145...
[heartbeat] 14:33:15 | output lines: 512 | elapsed: 120s | buildDepthMaps: 45% | last: Building point cloud from depth maps... chunk 3/12
...
[monitor] SUCCESS | total output lines: 5247 | elapsed: 3847s
[monitor] Full metashape output log saved to: /data/.../photogrammetry/metashape-build_depth_maps.log
Error with buffer dump (sparse mode):
[heartbeat] 15:47:00 | output lines: 3100 | elapsed: 7200s | buildDepthMaps: 60% | last: Processing depth map for camera 3175...
[monitor] === Last 100 lines before error ===
2024-02-08 15:47:15 Processing depth map for camera 3180...
...
2024-02-08 15:47:45 Error: Insufficient memory for depth map computation
RuntimeError: Not enough memory
[monitor] === End error context ===
[monitor] FAILED (exit code 1) | total output lines: 3247 | elapsed: 7215s
[monitor] Full metashape output log saved to: /data/.../photogrammetry/metashape-build_depth_maps.log
Full Log Files
Complete Metashape output is saved to the shared volume at:
These files contain every line of output as-is (no timestamps added) and are available for download from the Argo UI artifacts or via direct filesystem access. They are automatically cleaned up by the existing cleanup step after workflow completion.
Configuration
All three parameters have sensible defaults and require no configuration for normal use:
| Parameter | Default | Description |
|---|---|---|
LOG_HEARTBEAT_INTERVAL |
60 |
Seconds between heartbeat lines. 0 = full output mode |
LOG_BUFFER_SIZE |
100 |
Lines kept in memory for error context dump |
PROGRESS_INTERVAL_PCT |
1 |
Progress reporting interval (%) |
To use full output mode (e.g., for debugging or initial validation):
argo submit -n argo metashape-workflow.yaml \
-p LOG_HEARTBEAT_INTERVAL=0 \
# ... other parameters ...
Migration path
Start with LOG_HEARTBEAT_INTERVAL=0 (full output mode) to validate that progress callbacks and log files work correctly. Then switch to the default sparse mode (60) once you're comfortable with the reduced console output. You can always set it back to 0 without any code changes.
Completion Tracking and Skip-If-Complete
The workflow includes a completion tracking system that logs finished projects and can automatically skip already-completed work. This is useful for:
- Resuming cancelled workflows: Resubmit a workflow and automatically skip projects that already completed
- Iterative processing: Re-run with different postprocessing settings without redoing Metashape processing
- Cost optimization: Avoid wasting compute resources on already-completed projects
- Partial reruns: Selectively reprocess only Metashape or only postprocessing steps
How It Works
When COMPLETION_LOG_PATH is set:
- Metashape workflow: Reads the log to skip already-completed projects (based on
SKIP_IF_COMPLETE), and logsmetashapephase on completion - Postprocessing workflow: Reads the log to find projects with completed
metashapephase (via--require-phase), skips already-postprocessed projects, and logspostprocessphase on completion
Both workflows share the same completion log file, enabling the postprocessing workflow to automatically gate on metashape completion.
Completion Log Format
The completion log is a JSON Lines file (.jsonl) where each line represents a completed project phase:
{"project_name":"mission_001","phase":"postprocess","timestamp":"2024-01-15T10:30:00Z","workflow_name":"postprocessing-workflow-abc123"}
{"project_name":"mission_002","phase":"metashape","timestamp":"2024-01-15T11:45:00Z","workflow_name":"metashape-workflow-def456"}
Fields:
| Field | Description |
|---|---|
project_name |
Project identifier from config file |
phase |
Either "metashape" (Metashape processing complete) or "postprocess" (postprocessing complete) |
timestamp |
ISO 8601 UTC timestamp when the phase completed |
workflow_name |
Argo workflow name for traceability |
Backward compatibility
Existing log entries using the legacy completion_level field are still supported. The reader checks for phase first, falling back to completion_level.
Key behavior:
- Use separate log files for different configs (e.g.,
completion-log-default.jsonl,completion-log-highres.jsonl) - Each project can have at most two entries in a log file: one for
metashapeand one forpostprocess - If multiple entries exist for the same project, the highest phase is used (
postprocess>metashape) - The log file is created automatically if it doesn't exist
- Concurrent writes from parallel projects are handled safely with file locking
Skip Modes
SKIP_IF_COMPLETE is a boolean ("true" or "false") that controls whether to skip projects whose phase is already recorded in the completion log. Each workflow automatically checks its own phase:
- Metashape workflow: Skips projects with a completed
metashape(orpostprocess) phase - Postprocessing workflow: Skips projects with a completed
postprocessphase
| Value | Behavior | Use Case |
|---|---|---|
"false" (default) |
Never skip any projects | Fresh processing run |
"true" |
Skip projects already completed for this workflow's phase | Resume after cancellation |
Usage Examples
Resume a cancelled metashape workflow
If the metashape workflow was cancelled or failed partway through, resubmit to skip already-finished projects:
argo submit -n argo metashape-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/batch1.txt \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
-p SKIP_IF_COMPLETE=true \
-p TEMP_WORKING_DIR=/data/argo-output/tmp/batch1 \
# ... other parameters ...
Only projects that haven't completed metashape will run.
Run postprocessing on completed metashape projects
After metashape completes (or on projects with existing Metashape outputs):
argo submit -n argo postprocessing-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/batch1.txt \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
-p TEMP_WORKING_DIR=/data/argo-output/tmp/batch1-postprocess \
# ... other parameters ...
Only projects with completed metashape phase will be included.
Re-run postprocessing with different settings
To rerun postprocessing (e.g., changed clipping boundaries) while skipping already-postprocessed projects:
argo submit -n argo postprocessing-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/batch1.txt \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
-p SKIP_IF_COMPLETE=true \
-p TEMP_WORKING_DIR=/data/argo-output/tmp/batch1-reprocess \
# ... other parameters ...
Force complete reprocessing
To reprocess everything regardless of completion log:
argo submit -n argo metashape-workflow.yaml \
-p CONFIG_LIST=/data/argo-input/configs/batch1.txt \
-p COMPLETION_LOG_PATH=/data/argo-input/config-lists/completion-log-default.jsonl \
# ... other parameters ...
All projects will run (since SKIP_IF_COMPLETE defaults to "false"), and completion will still be logged for future use.
Bootstrapping from Existing Products
If you have projects that were processed before completion tracking was implemented, you can generate a retroactive completion log by scanning S3 buckets for existing products.
Use the generate_retroactive_log.py utility script (requires boto3 Python package):
# Install dependency
pip install boto3
# Set S3 credentials (for non-AWS S3 like Ceph/MinIO)
export S3_ENDPOINT=https://s3.example.com
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
# Generate log from existing S3 products for default config
python docker-workflow-utils/manually-run-utilities/generate_retroactive_log.py \
--internal-bucket ofo-internal \
--internal-prefix photogrammetry/default-run \
--public-bucket ofo-public \
--public-prefix postprocessed \
--output /data/argo-input/config-lists/completion-log-default.jsonl
# For a specific config (e.g., highres), use config-specific prefix and output file
python docker-workflow-utils/manually-run-utilities/generate_retroactive_log.py \
--internal-bucket ofo-internal \
--internal-prefix photogrammetry/default-run/photogrammetry_highres \
--public-bucket ofo-public \
--public-prefix postprocessed \
--output /data/argo-input/config-lists/completion-log-highres.jsonl
Script options:
| Option | Description |
|---|---|
--internal-bucket |
S3 bucket for internal/Metashape products |
--internal-prefix |
S3 prefix for Metashape products, including any config-specific subdirectories (e.g., photogrammetry/default-run for default config, or photogrammetry/default-run/photogrammetry_highres for highres config) |
--public-bucket |
S3 bucket for public/postprocessed products |
--public-prefix |
S3 prefix for postprocessed products |
--phase |
Which completion phases to detect: metashape, postprocess, or both (default: both) |
--output |
Output file path for completion log. Use config-specific names (e.g., completion-log-default.jsonl, completion-log-highres.jsonl) |
--append |
Append to existing log instead of overwriting |
--dry-run |
Preview what would be written without actually writing |
Example dry run to preview results:
python docker-workflow-utils/manually-run-utilities/generate_retroactive_log.py \
--internal-bucket ofo-internal \
--internal-prefix photogrammetry/default-run \
--public-bucket ofo-public \
--public-prefix postprocessed \
--dry-run \
--output /tmp/completion-log-default.jsonl
The script detects completed projects by looking for sentinel files:
- Metashape complete:
*_report.pdfin the project folder - Postprocess complete:
<project_name>_ortho.tifin the public bucket
Generating Remaining Configs After Cancellation
If you need to create a new config list containing only uncompleted projects (useful for manual workflow management):
python docker-workflow-utils/manually-run-utilities/generate_remaining_configs.py \
/data/argo-input/configs/batch1.txt \
/data/argo-input/config-lists/completion-log-default.jsonl \
--phase postprocess \
-o /data/argo-input/configs/batch1-remaining.txt
This reads the original config list, filters out completed projects, and outputs a new config list with only remaining projects. Note: Use the config-specific completion log file (e.g., completion-log-default.jsonl).
Troubleshooting Completion Tracking
Projects not being skipped when they should be
Possible causes:
- Wrong completion log file: Using the wrong config-specific log file
-
Solution: Ensure
COMPLETION_LOG_PATHpoints to the correct config-specific log (e.g.,completion-log-default.jsonlfor default config,completion-log-highres.jsonlfor highres config) -
Project name mismatch: The project name in the log doesn't match the config file's project name
- Debug: Check the
determine-projectsstep logs to see extracted project names -
Solution: Ensure
project.project_namein config matches the log entry -
Skip not enabled:
SKIP_IF_COMPLETEis"false"(the default) -
Solution: Set
-p SKIP_IF_COMPLETE=trueto skip already-completed projects -
Completion log path incorrect: The log file isn't where the workflow expects
- Debug: Check workflow logs for "completion log not found" messages
- Solution: Verify
COMPLETION_LOG_PATHis correct and accessible from containers
Projects being skipped when they shouldn't be
Possible causes:
- Stale log entries: The log contains entries from previous runs that should be removed
-
Solution: Manually edit the
.jsonlfile to remove unwanted entries, or start with a fresh log -
Wrong log file: Using a log file from a different configuration
- Solution: Verify you're using the correct config-specific log file (e.g.,
completion-log-default.jsonlfor default config, not a log from highres config)
Completion log corruption or malformed entries
Symptoms: Warnings in determine-projects logs about "malformed line" or "skipping line"
Causes:
- Manual editing introduced invalid JSON
- Concurrent writes without proper locking (shouldn't happen with the workflow, but possible with external tools)
Solutions:
-
Validate the JSON Lines file:
-
Regenerate from S3 using
generate_retroactive_log.py -
Manual fix: Edit the
.jsonlfile with a text editor, ensuring each line is valid JSON
Disk space issues with completion log
Unlikely scenario, but if the log grows very large (thousands of projects over many months):
- Solution: Archive or split old log entries by date/config_id
- Note: The log file size is minimal (~150 bytes per entry), so this is rarely a concern
Monitor the workflow
Using the Argo UI
The Argo UI is great for troubleshooting and checking individual step progress. Access it at argo.focal-lab.org, using the credentials from Vaultwarden under the record "Argo UI token".
Navigating the Argo UI
The Workflows tab on the left side menu shows all running workflows. Click a workflow to see a detailed DAG (directed acyclic graph) showing:
- Preprocessing task: The
determine-projectsstep that reads config files - Per-mission columns: Each mission shows as a separate column with all its processing steps
- Individual step status: Each of the 10+ steps shown with color-coded status
Step status colors:
- π’ Green (Succeeded): Step completed successfully
- π΅ Blue (Running): Step currently executing
- βͺ Gray (Skipped): Step was disabled in config or conditionally skipped
- π΄ Red (Failed): Step encountered an error
- π‘ Yellow (Pending): Step waiting for dependencies
Click on a specific step to see detailed information including:
- Which VM/node it's running on (CPU vs GPU node)
- Duration of the step
- Real-time logs
- Resource usage
- Input/output parameters
Viewing Step Logs
To view logs for a specific step:
- Click the workflow in Argo UI
- Click on the individual step node (e.g.,
match-photos-gpu,build-depth-maps) - Click the "Logs" tab
- Logs will stream in real-time if the step is running
Multi-mission miew
When processing multiple missions, the Argo UI shows all missions side-by-side. This makes it easy to:
- See which missions are at which step
- Identify if one mission is failing while others succeed
- Compare processing times across missions
- Monitor overall workflow progress
Understanding step names
Task names in the Argo UI follow the pattern process-projects-N.<step-name>:
process-projects-0.setup- Setup step for first mission (index 0)process-projects-0.match-photos-gpu- Match photos on GPU for first missionprocess-projects-1.build-depth-maps- Build depth maps for second mission (index 1)
Finding Your Mission
To identify which mission corresponds to which index:
- Check the
determine-projectsstep logs to see the order of missions in the JSON output - Click on any task (e.g.,
process-projects-0.setup) and view the parameters to see theproject-namevalue - The project name appears in all file paths, logs, and processing outputs
GPU-capable steps show either -gpu or -cpu suffix depending on config.
Using the CLI
View workflow status from the command line:
# Watch overall workflow progress
argo watch <workflow-name>
# List all workflows
argo list
# Get logs for preprocessing step
argo logs <workflow-name> -c determine-projects
# Get logs for a specific mission's step
# Format: process-projects-<N>.<step-name>
argo logs <workflow-name> -c process-projects-0.setup
argo logs <workflow-name> -c process-projects-0.match-photos-gpu
argo logs <workflow-name> -c process-projects-1.build-depth-maps
# Follow logs in real-time
argo logs <workflow-name> -c process-projects-0.setup -f
Workflow outputs
The final outputs will be written to S3:ofo-public in the following directory structure:
/S3:ofo-public/
βββ <OUTPUT_DIRECTORY>/
βββ dataset1/
βββ images/
βββ metadata-images/
βββ metadata-mission/
βββ dataset1_mission-metadata.gpkg
βββphotogrammetry_01/
βββ full/
βββ dataset1_cameras.xml
βββ dataset1_chm-ptcloud.tif
βββ dataset1_dsm-ptcloud.tif
βββ dataset1_dtm-ptcloud.tif
βββ dataset1_log.txt
βββ dataset1_ortho-dtm-ptcloud.tif
βββ dataset1_points.copc.laz
βββ dataset1_report.pdf
βββ thumbnails/
βββ dataset1_chm-ptcloud.png
βββ dataset1_dsm-ptcloud.png
βββ dataset1_dtm-ptcloud.png
βββ dataset1-ortho-dtm-ptcloud.png
βββphotogrammetry_02/
βββ full/
βββ dataset1_cameras.xml
βββ dataset1_chm-ptcloud.tif
βββ dataset1_dsm-ptcloud.tif
βββ dataset1_dtm-ptcloud.tif
βββ dataset1_log.txt
βββ dataset1_ortho-dtm-ptcloud.tif
βββ dataset1_points.copc.laz
βββ dataset1_report.pdf
βββ thumbnails/
βββ dataset1_chm-ptcloud.png
βββ dataset1_dsm-ptcloud.png
βββ dataset1_dtm-ptcloud.png
βββ dataset1-ortho-dtm-ptcloud.png
βββ dataset2/
This directory structure should already exist prior to running the Argo workflow.