A Complete Process of Adding & Scaling a New Application ========================================================= This is another demo where a new container with a new R package is added to be scaled appropriately using scalable. The first step would be to add a new container with the new R package. This should be done by adding a new target to the Dockerfile which is located in the home directory where scalable_bootstrap is launched. In this case, the `HELPS `_ package is added as a target to make a new container. Since the HELPS package is a R package, and rpy2 is the primary method of interacting with R applications when using Scalable, it would be good practice to try adding HELPS through rpy2. This is the new target: .. code-block:: dockerfile FROM ubuntu:22.04 AS build_env ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update && apt-get install -y \ wget unzip openjdk-11-jdk-headless build-essential libtbb-dev \ libboost-dev libboost-filesystem-dev libboost-system-dev \ python3 python3-pip libboost-python-dev libboost-numpy-dev \ ssh nano locate curl net-tools netcat-traditional git python3 python3-pip \ python3-dev gcc libboost-python1.74 libboost-numpy1.74 openjdk-11-jre-headless libtbb12 rsync RUN apt-get -y update && apt -y upgrade RUN pip3 install --upgrade dask[complete] dask-jobqueue dask_mpi pyyaml joblib versioneer tomli xarray RUN apt-get -y update && apt -y upgrade RUN mkdir -p /usr/lib/R/site-library ENV R_LIBS_USER /usr/lib/R/site-library RUN chmod a+w /usr/lib/R/site-library RUN apt-get install -y --no-install-recommends r-base r-base-dev RUN apt-get install -y --no-install-recommends python3-rpy2 RUN python3 -m pip install -U pip FROM build_env AS helps RUN apt-get -y update && apt -y upgrade && apt-get -y -f install RUN apt install --fix-missing -y software-properties-common RUN add-apt-repository ppa:ubuntugis/ppa RUN apt-get update ENV R_LIBS_USER /usr/lib/R/site-library RUN chmod a+w /usr/lib/R/site-library RUN apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libudunits2-dev libproj-dev libavfilter-dev \ libbz2-dev liblzma-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libgeos-dev libgdal-dev RUN git clone https://github.com/JGCRI/HELPS.git /HELPS RUN echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \ && echo "utils = rpackages.importr('utils')" >> /install_script.py \ && echo "utils.install_packages('devtools')" >> /install_script.py \ && echo "utils.install_packages('arrow')" >> /install_script.py \ && echo "utils.install_packages('assertthat')" >> /install_script.py RUN python3 /install_script.py RUN echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \ && echo "devtools = rpackages.importr('devtools')" >> /install_script.py \ && echo "devtools.install_github('JGCRI/HELPS', dependencies=True)" >> /install_script.py RUN python3 /install_script.py RUN echo "import rpy2.robjects.packages as rpackages" > /install_script.py \ && echo "helps = rpackages.importr('HELPS')" >> /install_script.py RUN python3 /install_script.py COPY --from=scalable /scalable /scalable RUN pip3 install /scalable/. RUN pip3 install --force-reinstall xarray==2024.5.0 RUN pip3 install --force-reinstall numpy==1.26.4 The build_env target is included in the base Dockerfile which is automatically downloaded. It is recommended to use build_env as the base target for all new containers. The helps target is manually added. Note that rpy2 isn't needed to download or install the helps package, it's done here for the sake of consistency. To learn more about how to convert R code to Python code using rpy2, this guide :doc:`rpy2` can be referred to. Now, once the the target is added, it can be selected by running the scalable_bootstrap and selecting the new target. It will be built locally and uploaded automatically to the remote system. The next step would be to make functions to actually run the helps package. As mentioned in the guide linked above though, using rpy2 to run R code may cause some issues and the best way to avoid it is to import everything needed in a preload script. In this case, the following preload script should work: .. code-block:: python def dask_setup(worker): import rpy2.robjects.packages as rpackages import rpy2.robjects as r from rpy2.robjects import StrVector helps = rpackages.importr('HELPS') worker.imports = {} worker.imports['HELPS'] = helps worker.imports['country_raster'] = r.r['country_raster'] worker.imports['reg_WB_raster'] = r.r['reg_WB_raster'] After passing the path to the preload script above to the container, the worker object can be imported in the target functions to then use the needed R objects. Please note that all the R objects that are needed should be imported in the preload script beforehand. The following code is an example of how the HELPS package can be used with scalable, and includes the target functions which will be ran on the remote system: .. code-block:: python from scalable import * def run_heatstress(hurs_file_name, tas_file_name, rsds_file_name, target_year): worker = get_worker() helps = worker.imports['HELPS'] # generate a heat stress raster brick for the desired resolution heat_stress_raster = helps.cal_heat_stress( TempRes = "month", SECTOR = "SUNF_R", HS = helps.WBGT_ESI, YEAR_INPUT = target_year, a=hurs_file_name, b=tas_file_name, c=rsds_file_name ) return heat_stress_raster def run_physical_work_capacity(heat_stress_raster): worker = get_worker() helps = worker.imports['HELPS'] # generate physical work capacity raster brick physical_work_capacity_raster = helps.cal_pwc( WBGT = heat_stress_raster, LHR = helps.LHR_Hothaps, workload = "high" ) return physical_work_capacity_raster def run_annualized_physical_work_capacity(physical_work_capacity_raster): worker = get_worker() helps = worker.imports['HELPS'] # aggregate physical work capacity to annual values and reformat to a data frame annualized_physical_work_capacity_df = helps.monthly_to_annual( input_rack = physical_work_capacity_raster, SECTOR = "SUNF_R" ) return annualized_physical_work_capacity_df def run_country_physical_work_capacity(annualized_physical_work_capacity_df): worker = get_worker() helps = worker.imports['HELPS'] country_raster = worker.imports['country_raster'] # map annual physical work capacity to gridded countries country_physical_work_capacity_df = helps.grid_to_region( grid_annual_value = annualized_physical_work_capacity_df, SECTOR = "SUNF_R", rast_boundary = country_raster ) return country_physical_work_capacity_df def run_basin_physical_work_capacity(annualized_physical_work_capacity_df): worker = get_worker() helps = worker.imports['HELPS'] reg_WB_raster = worker.imports['reg_WB_raster'] # map annual physical work capacity to gridded water basins basin_physical_work_capacity_df = helps.grid_to_region( grid_annual_value = annualized_physical_work_capacity_df, SECTOR = "SUNF_R", rast_boundary = reg_WB_raster ) return basin_physical_work_capacity_df if __name__ == "__main__": hurs_file_name = "path/to/hurs_file" tas_file_name = "path/to/tas_file" rsds_file_name = "path/to/rsds_file" # all the functions need to be ran for all the target years. # in this case, target years would be 2015 - 2100 in 5 year increments target_years = list(range(2015, 2105, 5)) num_years = len(target_years) ## Creating a SlurmCluster object with the required parameters cluster = SlurmCluster(queue='short', walltime='02:00:00', account='GCIMS', silence_logs=False) ## Adding the helps container specifications. R is almost always single ## threaded so the number of cpus is set to 1. The memory is set to 8G. cluster.add_container(tag="helps", cpus=1, memory="8G", preload_script='/path/to/preload_script.py', dirs={"/path1":"/path1", "/path2":"/path2"}) ## Adding workers to the cluster cluster.add_workers(n=num_years, tag="helps") # Making a client to submit jobs sc_client = ScalableClient(cluster) # run helps # note that the map function is used here as multiple instances of the # same function is being ran with different inputs. This is essentially # parallelization. To make it possible, pass in multiple lists of # arguments to the map function. If the target function has 2 arguments # then 2 lists of the same size should be passed. The size of the list # will be the same as the number of instances of the target function # that will be ran. # n = 1 is used as the value for n because n specifies the number of # workers needed for a single instance of the target function. heatstress_futures = sc_client.map(run_heatstress, [hurs_file_name]*num_years, [tas_file_name]*num_years, [rsds_file_name]*num_years, target_years, n=1, tag="helps") pwc_futures = sc_client.map(run_physical_work_capacity, heatstress_futures, n=1, tag="helps") annualized_pwc_futures = sc_client.map(run_annualized_physical_work_capacity, pwc_futures, n=1, tag="helps") country_pwc_futures = sc_client.map(run_country_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps") basin_pwc_futures = sc_client.map(run_basin_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps") # now the results can be gathered and then printed or written to a file heatstress_results = sc_client.gather(heatstress_futures) pwc_results = sc_client.gather(pwc_futures) annualized_pwc_results = sc_client.gather(annualized_pwc_futures) country_pwc_results = sc_client.gather(country_pwc_futures) basin_pwc_results = sc_client.gather(basin_pwc_futures) This code will run the HELPS package on the remote HPC system. The entire guide highlights the process of adding a new container with a new R package and scaling it using Scalable. Please feel free to reach out for any more help regarding the same or open an issue on the Scalable GitHub repository.