Distributed Processing

7.14. Distributed Processing

Overview #

SICS Distributed Processing is a form of parallel processing, allowing multiple instances of SICS to execute a function in parallel with the aim of completing that function in a shorter time than if it was only executed on one instance. Each instance may be running on the same or separate machines, regardless of whether those machines are physical or virtual.

The first step in designing a process for parallel execution is process decomposition - breaking a process down into separate steps that can be executed in parallel.

There are 2 primary means of process decomposition:

Domain Decomposition

This is where the data that the process operates on is divided into pieces of approximately equal size with an independent operation defined on each of them.

Functional Decomposition

This is where the process is broken down into independent functional units which may or may not operate on the same data.

SICS Distributed Processing is primarily concerned with the former method, although it does not rule out the latter.

Setting Up Engines #

A SICS Distributed Processing engine (or node) is a Batch Scheduler with a specific configuration for processing distributed tasks.

Throughout the user interface, Node is used when referring to a processing engine.

You can start as many engines as you need. Note that 2 engines will perform approximately twice as fast as 1 engine. However 4 engines will not be twice as fast as 2 engines as there is an overhead for task distribution. The optimum number of engines for your environment and process type will vary and requires some testing to determine.

The Distributed Processing engines are, like SICS in general, single threaded. This means that running one engine on a machine with more than one CPU/Core will not provide any processing throughput advantage over running it on a machine with only one CPU/Core.

As a general rule of thumb DXC advocates running no more than N-1 engines on a single machine, where N is the number of CPU Cores available to the operating system.

You may start more than one engine on the same machine. This will improve throughput but will affect the ability to review the execution statistics as noted below.

Configuring Distributed Engines #

In the System Admin desktop under the Periodic Functions folder, open the Scheduler folder and open the Scheduler Setup function.

Create a new Scheduler, you can use any name or discription, however it is advised to make the name obviously to be used for distributed processing.

On the Configuration tab of the window, create a new Configuration. Select the newly created scheduler from the dropdown and select a Job Step Type of Distributed Processor. Any Priority may be used.

You cannot mix the Distributed Processor job step configuration with any other configuration on the same scheduler. This ensures that a scheduler is dedicated to processing distributed tasks and so will not accidentally hold up other jobs.

Starting a Distributed Engine #

To start a distributed engine, you put SICS into scheduler mode and select a named scheduler configured for distributed processing as described above.

You can do this from the Scheduled Jobs function in the System Admin desktop by selecting the Process Scheduled Jobs menu item from the menu bar. You then select the Distributed Processor scheduler from the drop down.

You can also start an engine from the command line by providing the scheduler name in the start up arguments. For Example “-bn:Distributed Engine” where ‘Distributed Engine’ is the name of the scheduler created above.

The Batch Scheduler server may also be set up with distributed schedulers.

Master Switch #

In System Parameters there is a master switch that will enable or disable distributed processing. The switch is called “Allow Distributed Processing” on the Database tab in System Parameters.

If you try to start a new engine with the switch unticked, the engine will login but not process any jobs.

If you untick the switch while an engine is running, it will finish processing its current task and then terminate.

Creating Processes #

You create a process from the Distributed Processes function in the System Administration desktop, located in the Periodic Functions folder.

This opens a find window, where there is a “Submit New” button.

The Submit Distributed Process window has a dropdown to select the kind of process to create. The text description tells you what kind of process it will create and how many tasks it will contain.

Once you select a Process from the drop down a description of the process will be shown in the text field. Once you are happy with the process, you can click the Create button to create the process.

This creates the process in an pending status, which means that it is available for processing by any active engines, and gives the following message:

Field description 4. Submit Distributed Processes

Field	Description
Process	Used to select which type of distributed process to create Values: i) “Business Reporting Table Builder” ii) “Business Reporting Table Part Builder” Applicable for: P&C, Cede, Life
Selected businesses	Values: A string interpreted as a Java regular expression used to pick businesses by identifier Validation: Must be a valid java regular expression Mandatory: Yes - when “Business Reporting Table Part Builder” is chosen Applicable for: P&C, Cede, Life

Available Processes: #

Example Process

The example process is a simple process to be used for testing the configuration to ensure that the engines are setup correctly and everything is functioning as expected.

Business Reporting Table Builder

The BRT builder will create a process to rebuild the business reporting table. The process will contain a task for every business in the database plus 2 control tasks.

The BRT builder may run while the system is in use. While the process is executing only the data for the tasks that active will be read and processed. There will be one task active for each running engine.

Business Reporting Table Part Builder

The BRT part builder is a refinement of the full BRT builder process that allows the submitter to filter the selected businesses. This allows you to rebuild the reporting table for a subset of the businesses in the database. This could be as granular as one specific business ID up to the entire set.

The Selected Business field provides a regular expression entry field. This specifies the criteria that will used to filter the set of businesses. The Evaluate button lets you check your regular expression against the database to ensure that the filter will select the intended businesses.

Cancelling a Process #

From the find window, the right mouse menu on a selected found process provides the ability to cancel a process. You can cancel a process while it is pending or active.

Cancelling an active process will prevent any new tasks from starting. It will not cancel any currently executing tasks.

Monitoring Processes #

From the Find Distributed Process window, double click on a process in the result list to open the View Distributed Process window.

Task Summary Tab #

The Task Summary tab gives you a status for the process and some statistics on its progress and performance.

The task status table gives an numerical breakdown of the tasks by status with a completion progress bar underneath. By default this list does not auto refresh. Clicking the Refresh button will refresh the list manually. Ticking the “Auto” tickbox will set the list to auto-refresh for the number of seconds selected in the accompanying dropdown.

The TPM field shows the Tasks Per Minute execution rate for the previous few minutes of execution. This is used to provide an estimate of many minutes remain for the process to complete in the Est. Remaining field.

Execution Breakdown Tab #

The execution breakdown tab allows you to query the execution statistics for completed tasks to see a breakdown of performance.

The Number of Intervals field defines the slices of time that complete execution duration will be broken down into for analysis. So for example, if a process took 2 hours to complete entering 12 would let you see breakdown stats in 10 minute intervals.

The statistics table show the start time, end time and duration of each interval along with how many tasks were completed in that interval and how many nodes were used. This data is calculated into a Task Per Minute value that is shown in the TPM column.

Node Stats Tab #

The Node Stats tab provides information about each of the nodes that were used to execute tasks for this process.

Name: This is the name of the computer that was executing the task. Note that if multiple engines are started on the same computer to take advantage of multiple CPUs, there is no distinction made here between engines. The statistics are for the entire machine.

Avg TPM: The average tasks per minute that the node was able to execute.

~Mhz: This is the CPU speed of the machine. Note that is the stated maximum speed, if any CPU throttling or performance boosting is employed this is not reflected in the number shown here.

Memory: The amount of RAM available to the OS.

Started Processing: The timestamp when the node first started processing a task.

Ended Processing: The timestamp when the last task was completed.

Active (mins): The number of minutes that the node was active for this process. This is the duration from the started and ended processing timestamps and represents the total amount of time that the node spent processing tasks for this process.

Executing (mins): The total number of execution minutes for all the tasks that the node processed.

Idle (mins): The number of minutes that the node spent not executing tasks. This is the active minutes - the execution minutes. If this number is more than a few percent of the execution time it may indicate that there is a performance issue that needs investigation.

# Tasks: The number of tasks that the node executed for this process.

Avg Assigns: When a node tries to allocate a task to itself for processing, it may find itself competing for another node for the task. In this case the first node gets allocated the task and the second node has to pick a new task. This column shows the average number of assignment attempts made in order to allocate tasks to the node. This should be one or close to one. Anything significantly higher than one may indication a performance issue that needs investigation.