Introduction

The calculation of probabilistic outputs in operational flood forecasting often requires that models are run multiple times with varied boundary condition or model structure. In order to reduce the computation time, ensembles can be run in parallel.

In order to perform these operations grid computing techiques can be utilized. Condor is a specialized workload management system for computationally intensive jobs. Condor provides the necessary tools such as job queuing, scheduling, priority management, resource monitoring, and resource management to enable multiple model runs to be made on multiple nodes connected to Delft-Fews forecasting shell machines (as head node). Serial or parallel jobs can be submitted to Condor which are then placed into a queue, and run based on how condor is configured.

With this type of grid computing there is a normally an overhead (running 50 ensembles in parallel does not mean it will be 50 times quicker). The computational overhead is dependant on the amount of static and dynamic data which must be transferred between the forecasting shell and the node. This can be minimized by preloading the nodes with static data and using compressed or efficient data forms for dynamic data.

Setting up Condor in the general adapter

The Condor computations are run as part of a normal General Adapter run - no modification to the workflow is needed.

In the general section of the generalAdapterRun the number of ensembles should be specified (Please note that currently the general adapter ensemble run can only be executed from one initial state. Ensembles of historical states are not currently supported):

<ensembleMemberCount>17</ensembleMemberCount>

The tag %ENSEMBLE_MEMBER_ID% can then be used to loop through the ensemble output directories.

<?xml version="1.0" encoding="UTF-8"?>
<!--Delft FEWS PO-->
<generalAdapterRun xmlns="http://www.wldelft.nl/fews" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.wldelft.nl/fews http://fews.wldelft.nl/schemas/version1.0/generalAdapterRun.xsd">
	<!-- General information for General Adapter run -->
	<general>
		<description>Sobek Model run for the Como Lake</description>
		<rootDir>%REGION_HOME%/Modules/SbkParallel/como</rootDir>
		<workDir>%ROOT_DIR%</workDir>
		<exportDir>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_in</exportDir>
		<exportIdMap>IdSobekExportForecast</exportIdMap>
		<importDir>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_out</importDir>
		<importIdMap>IdSobekImportForecast</importIdMap>
		<dumpFileDir>$GA_DUMPFILEDIR$</dumpFileDir>
		<dumpDir>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%</dumpDir>
		<diagnosticFile>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_out/diagnostic.xml</diagnosticFile>
		<convertDatum>true</convertDatum>
		<ensembleMemberCount>17</ensembleMemberCount>
	</general>
	<activities>
		<startUpActivities>
			<purgeActivity>
				<filter>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_in/*.*</filter>
			</purgeActivity>
			<purgeActivity>
				<filter>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_out/*.*</filter>
			</purgeActivity>
			<purgeActivity>
				<filter>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/model/*.*</filter>
			</purgeActivity>
			<purgeActivity>
				<filter>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/log/*.*.*</filter>
			</purgeActivity>
		</startUpActivities>
		<!-- Export activities -->
		<exportActivities>
			<!-- Export state (warm state)-->
			<exportStateActivity>
				<moduleInstanceId>Sobek_Po_Como_Historical</moduleInstanceId>
				<stateExportDir>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/model</stateExportDir>
				<stateConfigFile>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/fews_in/states.xml</stateConfigFile>
				<stateLocations type="file">
					<stateLocation>
						<readLocation>sobek.rda</readLocation>
						<writeLocation>sobek.nda</writeLocation>
					</stateLocation>
					<stateLocation>
						<readLocation>sobek.rdf</readLocation>
						<writeLocation>sobek.ndf</writeLocation>
					</stateLocation>
				</stateLocations>
				<stateSelection>
					<warmState>
						<stateSearchPeriod unit="hour" start="-48" end="0"/>
					</warmState>
				</stateSelection>
			</exportStateActivity>
			<!-- Export time series -->
			<exportTimeSeriesActivity>
				<description>Export inflows for Sobek Como Lake model</description>
				<exportFile>Input_Como_Pi.xml</exportFile>
				<timeSeriesSets>
					<timeSeriesSet>
						<moduleInstanceId>Sobek_MergeInput_Forecast_COSMO</moduleInstanceId>
						<valueType>scalar</valueType>
						<parameterId>Q.simulated.forecast</parameterId>
						<locationId>Adda23635</locationId>
						<timeSeriesType>simulated forecasting</timeSeriesType>
						<timeStep unit="hour" multiplier="1"/>
						<relativeViewPeriod unit="hour" endOverrulable="true" end="72"/>
						<readWriteMode>read only</readWriteMode>
						<ensembleId>CosmoLeps</ensembleId>
					</timeSeriesSet>
					<timeSeriesSet>
						<moduleInstanceId>Interpolation_Sobek_Forecast</moduleInstanceId>
						<valueType>scalar</valueType>
						<parameterId>H.forecast.external</parameterId>
						<locationId>I-203000</locationId>
						<timeSeriesType>external forecasting</timeSeriesType>
						<timeStep unit="hour" multiplier="1"/>
						<relativeViewPeriod unit="hour" endOverrulable="true" end="72"/>
						<readWriteMode>read only</readWriteMode>
					</timeSeriesSet>
				</timeSeriesSets>
			</exportTimeSeriesActivity>
		</exportActivities>
		<!-- Export activities -->
		<!-- Execute activities:Run SOBEK Adapter, Batch tool -->
		<executeActivities>
			<executeActivity>
				<command>
					<className>nl.wldelft.fews.adapter.sobek.PreSobekModelAdapter</className>
				</command>
				<arguments>
					<argument>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%</argument>
					<argument>../Config/sobekConfig.xml</argument>
				</arguments>
				<timeOut>500000</timeOut>
				<overrulingDiagnosticFile>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/diagnostics/presobekmodeladapter.xml</overrulingDiagnosticFile>
			</executeActivity>
			<!-- Run Condor -->
			<executeActivity>
				<description>Condor Batch script by Frederik</description>
				<command>
					<executable>%ROOT_DIR%\run_condor_sobek.bat</executable>
				</command>
				<arguments>
					<argument>-o</argument>
					<argument>\\$CONDOR_REMOTE_DIR$\SbkParallel</argument>
					<argument>-n</argument>
					<argument>17</argument>
					<argument>-t</argument>
					<argument>18000000</argument>
					<argument>-d</argument>
					<argument>\\$CONDOR_REMOTE_DIR$\SbkParallel</argument>
				</arguments>
				<!-- timeout in milliseconds: 30min x 60sec -->
				<timeOut>19000000</timeOut>
				<ignoreDiagnostics>true</ignoreDiagnostics>
			</executeActivity>
			<executeActivity>
				<command>
					<className>nl.wldelft.fews.adapter.sobek.PostSobekModelAdapter</className>
				</command>
				<arguments>
					<argument>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%</argument>
					<argument>../Config/sobekConfig.xml</argument>
				</arguments>
				<timeOut>500000</timeOut>
				<overrulingDiagnosticFile>%ROOT_DIR%/%ENSEMBLE_MEMBER_ID%/diagnostics/postsobekmodeladapter.xml</overrulingDiagnosticFile>
			</executeActivity>
		</executeActivities>
		<importActivities>
			<!-- Import Sobek results-->
			<importTimeSeriesActivity>
				<description>Import XML file</description>
				<importFile>reachseg.xml</importFile>
				<timeSeriesSets>
					<timeSeriesSet>
						<moduleInstanceId>Sobek_Po_Como_Forecast_COSMO_Parallel</moduleInstanceId>
						<valueType>scalar</valueType>
						<parameterId>Q.simulated.forecast</parameterId>
						<locationId>Serbatoio_Como235</locationId>
						<timeSeriesType>simulated forecasting</timeSeriesType>
						<timeStep unit="hour" multiplier="1"/>
						<readWriteMode>add originals</readWriteMode>
						<expiryTime unit="day" multiplier="2"/>
						<ensembleId>CosmoLeps</ensembleId>
					</timeSeriesSet>
				</timeSeriesSets>
			</importTimeSeriesActivity>
		</importActivities>
	</activities>
</generalAdapterRun>

The output directories must be created in the modules directory to receive the correct data for the ensemble run. I.e. in this case data is exported to the directories (see <exportDir> tag above):

Modules\SbkParallel\Como\0\fews_in\
Modules\SbkParallel\Como\1\fews_in\
Modules\SbkParallel\Como\2\fews_in\
...
Modules\SbkParallel\Como\17\fews_in\

Computational nodes can either read and write the data across the network or data can be copied to the local directories to each of the available computational nodes. The ability to read and write data across the network depends on the network speed, size of files being read and written and the IO capacity of the machine.

Condor determines which node will perform which task. It is possible to determine specific tags within the condor computation node pool indicating which machines have certain characteristics e.g. Windows/Linux, Model license available etc.

When the computations are complete the data is then sent back to the forecasting shell for reimport into Fews (see importActivities of the general adapter).

Execution of the Condor batch script

From the example above we see that the general adapter executes a batch script with a number of arguments

			<!-- Run Condor -->
			<executeActivity>
				<description>Condor Batch script by Frederik</description>
				<command>
					<executable>%ROOT_DIR%\run_condor_sobek.bat</executable>
				</command>
				<arguments>
					<argument>-o</argument>
					<argument>\\$CONDOR_REMOTE_DIR$\SbkParallel</argument>
					<argument>-n</argument>
					<argument>17</argument>
					<argument>-t</argument>
					<argument>18000000</argument>
					<argument>-d</argument>
					<argument>\\$CONDOR_REMOTE_DIR$\SbkParallel</argument>
				</arguments>
				<!-- timeout in milliseconds: 30min x 60sec -->
				<timeOut>19000000</timeOut>
				<ignoreDiagnostics>true</ignoreDiagnostics>
			</executeActivity>

The batch script then executes a shell script examples of which can be found attached. Please note that these batch scripts are not generic and are provided as an example only. Further information about modification of these scripts should be directed to the Fews product managers.

How to set up a Condor service

More information on how to up a Condor service and creating a computational pool can be found on the Condor project web pages