# Using the MLOps tool to run GeoWATCH NOTE: THE PATH FORMAT IS IN HIGH FLUX. DOCS MAY BE OUTDATED MLOps is a wrapper around the cmd_queue library that provides a single entrypoint for all steps of the GeoWATCH TA-2 pipeline, starting with a model checkpoint and the program data: 1. Predict BAS on low-res data (S2/L8) 2. Pixel evaluation metrics on BAS 3. Convert to polygons in IARPA site model geojson format 4. Score on IARPA metrics [using a heuristic for SC phases] 5. Visualize BAS 6. Create cropped SC dataset from BAS predictions with high-res data (S2/WV/PD) 7. Predict SC on high-res data 2. Pixel evaluation metrics on SC 3. Convert to polygons in IARPA site model geojson format (again) 4. Score on IARPA metrics (again) 5. Visualize SC You can invoke any one of these things through `python -m geowatch...` without MLOps, but it handles all the plumbing of feeding jobs' input/output to each other and spawning them to use all your available compute resources. ## using DVC and geowatch_dvc (Note: `geowatch_dvc` will be replaced by `sdvc` from the `simple_dvc` package.) `geowatch_dvc` is an alias for `python -m geowatch.cli.find_dvc`. MLOps uses it to manage paths to your DVC repos. To get started: `geowatch_dvc list` Do you see a data and expt repo with exists=True? If not, a) you have an existing local DVC checkout, but it's not listed ```bash geowatch_dvc add --name=my_data_repo --path=path/to/smart_data_dvc --hardware=hdd --priority=100 --tags=phase2_data geowatch_dvc add --name=my_expt_repo --path=path/to/smart_expt_dvc --hardware=hdd --priority=100 --tags=phase2_expt ``` b) you don't have an existing local DVC checkout clone the source repos: https://gitlab.kitware.com/smart/smart_data_dvc https://gitlab.kitware.com/smart/smart_expt_dvc and put them in the recommended places. these can also be overridden by environment variables; do so at your own risk if this worked, you can: ```bash cd $(geowatch_dvc --tags="phase2_expt") ``` cd doesn't work when the terminal is too narrow for 1 line, because it'll prettyprint with linebreaks ### sidebar: setting up a shared DVC cache on a shared machine Multiple users can share a DVC cache so they don't have to duplicate the data! ```bash dvc config cache.shared group dvc config cache.type symlink dvc cache dir /data/dvc-caches/smart_watch_dvc # make sure everyone can read/write here dvc checkout ``` ## using mlops to checkout a model `geowatch mlops` is an alias for `python -m geowatch.cli.mlops_cli` or `python -m geowatch.mlops.expt_manager`. The entrypoint in code is `watch/mlops/expt_manager.py`. Let's choose a model and do stuff with it! ```bash geowatch mlops --help MODEL="Drop4_BAS_Retrain_V002" geowatch mlops "pull packages" --model_pattern="${MODEL}*" geowatch mlops "pull evals" --model_pattern="${MODEL}*" geowatch mlops "status" python -m geowatch.mlops.expt_manager "list" --model_pattern="${MODEL_OF_INTEREST}*" # for more info python -m geowatch.mlops.expt_manager "report" --model_pattern="${MODEL_OF_INTEREST}*" # for more info ``` The "volatile" table shows model predictions, which are not versioned for space reasons, but are an intermediate product of running pixel and polygon evaluations. The "versioned" table shows models and evaluations which should now exist in your local DVC repo. the "versioned" table in more detail: - type - "pkg_fpath" (model) or "eval_*" - has_raw - the file exists - has_dvc - the `.dvc` file exists - needs_pull - the `.dvc` file is not backed by data on this machine - needs_push - the `.dvc` file is not backed by data on the remote TODO does "staging" still exist? models on disk live in `$(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/packages`. predictions on disk live in `$(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/pred`. evals on disk live in `$(geowatch_dvc --tags="phase2_expt")/models/fusion/${DATASET}/eval`. ## using mlops to run an evaluation geowatch mlops "evaluate" should work as an alias but doesn't right now if you haven't used this TEST_DATASET yet, remember to unzip its splits.zip first ```bash DATASET_CODE=Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC DATA_DVC_DPATH=$(geowatch_dvc --tags="phase2_data") DVC_EXPT_DPATH=$(geowatch_dvc --tags="phase2_expt") TEST_DATASET=$DATA_DVC_DPATH/$DATASET_CODE/data.kwcoco.json python -m geowatch.mlops.schedule_evaluation \ --params=" matrix: trk.pxl.model: - $DVC_EXPT_DPATH/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/packages/Drop4_BAS_Retrain_V002/Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt trk.pxl.data.test_dataset: - $TEST_DATASET trk.pxl.data.window_scale_space: 15GSD trk.pxl.data.time_sampling: - "contiguous" trk.pxl.data.input_scale_space: - "15GSD" trk.poly.thresh: - 0.07 - 0.1 - 0.12 - 0.175 crop.src: - /home/joncrall/remote/toothbrush/data/dvc-repos/smart_data_dvc/online_v1/kwcoco_for_sc_fielded.json crop.regions: - trk.poly.output act.pxl.data.test_dataset: - /home/joncrall/remote/toothbrush/data/dvc-repos/smart_expt_dvc/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/crop/online_v1_kwcoco_for_sc_fielded/trk_poly_id_0408400f/crop_f64d5b9a/crop_id_59ed6e1b/crop.kwcoco.json act.pxl.data.input_scale_space: - 3GSD act.pxl.data.time_steps: - 3 act.pxl.data.chip_overlap: - 0.35 act.poly.thresh: - 0.01 act.poly.use_viterbi: - 0 act.pxl.model: - $DVC_EXPT_DPATH/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-WV-PD-ACC/packages/Drop4_SC_RGB_scratch_V002/Drop4_SC_RGB_scratch_V002_epoch=99-step=50300-v1.pt.pt include: - act.pxl.data.chip_dims: 256,256 act.pxl.data.window_scale_space: 3GSD act.pxl.data.input_scale_space: 3GSD act.pxl.data.output_scale_space: 3GSD " \ --enable_pred_trk_pxl=1 \ --enable_pred_trk_poly=1 \ --enable_eval_trk_pxl=1 \ --enable_eval_trk_poly=1 \ --enable_crop=0 \ --enable_pred_act_pxl=0 \ --enable_pred_act_poly=0 \ --enable_eval_act_pxl=0 \ --enable_eval_act_poly=0 \ --enable_viz_pred_trk_poly=1 \ --enable_viz_pred_act_poly=0 \ --enable_links=1 \ --devices="0,1" --queue_size=2 \ --queue_name='secondary-eval' \ --backend=serial --skip_existing=0 \ --run=0 ``` the enable pred, eval, crop, and viz flags are the various stages of the GeoWATCH TA2 system. They are all enabled by default. Any params for a step with `--enable_*=0` will be ignored, and duplicate params will be grid-searched over. The params are namespaced for convenience - trk.pxl - BAS predict and pixel evaluate - trk.poly - BAS tracking and IARPA poly evaluate - crop - create SC dataset - act.pxl - SC predict and pixel evaluate - act.poly - SC tracking and IARPA poly evaluate First, dry-run with run=0, and if it looks ok, set run=1. --backend=tmux is also highly recommended instead of serial, it makes the running queue much easier to debug (backend="slurm" is also available if your machine supports it). `--virtualenv_cmd="conda activate watch"` or something to that effect is also needed if your bashrc does not automatically drop you into the watch virtualenv. ## debugging/examining a running mlops invocation This assumes you're using --backend=tmux. check on a running queue with ```bash tmux a s # switch between sessions d # quit back to main window ``` You'll see each session start with the command `source /long/path/_cmd_queue_schedule/${QUEUE_NAME}_etc/etc/etc.sh`. The paths will also be printed to the main window for convenience, along with the commands to kill any leftover tmux sessions or drop into them. These sourced scripts define what is being run in each tmux session ("job"). In serial mode, they will all run in the main window. When each job finishes, the result will be cached and its dependencies will start running. ### sidebar: help, my predict step is hanging! This might happen due to data workers not being cleaned up; you'll see "return dset.fpath = ..." but then the predict job will never finish. Kill the python process, open the sourced script for that job and manually run the steps under "# Signal this worker is complete". You may have to kill and rerun the whole mlops pipeline, but it will pick up where it left off. ## What does an mlops invocation run? You can dig through the sourced scripts to piece together what a full GeoWATCH TA2 run looks like, minus the boilerplate and error handling. (There are plenty of examples in the mlops source code, but here's how to tell exactly what you were running.) For example, the pipeline above turns into: ```bash # prediction (omitted) # python -m geowatch.tasks.fusion.predict ... # tracking ROOT="/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_expt_dvc-ssd/models/fusion/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/pred/trk/Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC_data.kwcoco" PXL_EXPT="${ROOT}/trk_pxl_b788335d" TRK_EXPT="trk_poly_ca4372e1" python -m geowatch.cli.run_tracker \ "${TRK_ROOT}/pred.kwcoco.json" \ --default_track_fn saliency_heatmaps \ --track_kwargs '{"thresh": 0.12, "moving_window_size": null, "polygon_fn": "heatmaps_to_polys"}' \ --clear_annots \ --out_sites_dir "${TRK_ROOT}/${TRK_EXPT}/sites" \ --out_site_summaries_dir "${TRK_ROOT}/${TRK_EXPT}/site-summaries" \ --out_sites_fpath "${TRK_ROOT}/${TRK_EXPT}/site_tracks_manifest.json" \ --out_site_summaries_fpath "${TRK_ROOT}/${TRK_EXPT}/site_summary_tracks_manifest.json" \ --out_kwcoco "${TRK_ROOT}/${TRK_EXPT}/tracks.kwcoco.json" # pxl eval python -m geowatch.tasks.fusion.evaluate \ --true_dataset=/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/Aligned-Drop4-2022-08-08-TA1-S2-L8-ACC/data.kwcoco.json \ --pred_dataset="${TRK_ROOT}/pred.kwcoco.json" \ --eval_dpath="${TRK_ROOT}" --score_space=video \ --draw_curves=True \ --draw_heatmaps=True \ --viz_thresh=0.2 \ --workers=2 # iarpa poly eval python -m geowatch.cli.run_metrics_framework \ --merge=True \ --name "Drop4_BAS_Retrain_V002_epoch=31-step=16384.pt.pt-trk_pxl_b788335d-${TRK_EXPT}" \ --true_site_dpath "/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/annotations/site_models" \ --true_region_dpath "/home/local/KHQ/matthew.bernstein/data/dvc-repos/smart_data_dvc-ssd/annotations/region_models" \ --pred_sites "${TRK_ROOT}/${TRK_EXPT}/site_tracks_manifest.json" \ --tmp_dir "${TRK_ROOT}/${TRK_EXPT}/_tmp" \ --out_dir "${TRK_ROOT}/${TRK_EXPT}" \ --merge_fpath "${TRK_ROOT}/${TRK_EXPT}/merged/summary2.json" \ # viz geowatch visualize \ "${TRK_ROOT}/${TRK_EXPT}/tracks.kwcoco.json" \ --channels="red|green|blue,salient" \ --stack=only \ --workers=avail/2 \ --workers=avail/2 \ --extra_header="\ntrk_pxl_b788335d-${TRK_EXPT}" \ --animate=True ```