Keylink Software Solution Specialists
Direct Data Flows for Performance
Direct Data Flows for Performance
Posted by:Steve | Mon 28 July 2008

Today I'd like to talk about the Direct Data Flows feature of DMExpress which provides significant performance improvements to ETL jobs by both reducing the amount of disk input/output and allowing tasks to run in parallel with one another.

To achieve acceptable performance levels many of the competing products to DMExpress require developer's to manually partition or divide their data up, which then allows multiple tasks to process separate chunks of data at the same time (ie. in parallel) and so reduce the time taken to complete the job.

By contrast, DMExpress is written from the ground up with parallel processing in mind. This means DMExpress will intelligently run jobs and tasks in parallel without user intervention as part of its normal operation.

Note that DMExpress does indeed have a "Partition Data" function but this is more to meet customer requirements, for example splitting the data into separate files based on state, postcode, or product id, rather than as a necessity to reach required performance levels.

For this example, we'll take a DMExpress Job which contains 3 Aggregate Tasks and a Merge Task:

We can activate Direct Data Flows via the Edit menu of the DMExpress Job Editor:

When we turn Direct Data Flows on we see:

You'll notice that 4 of the intermediate output files for the Aggregate Tasks are now "grayed-out". This means that as each task runs, rather than take the time to output an intermediate text file to disk (which will then need to be read back from disk so it can be used as an input for the next task), the data is instead "streamed" in memory between the 3 Aggregate Tasks which will now execute in parallel. This provides a significant speed boost as reading/writing to disk is comparatively slow, and should be avoided when possible.

You will also notice that not all of the output files are using Direct Data Flows (ie. grayed-out). This is because the final Merge Task is dependant on all 3 of its input files to be available before it can run. This illustrates how DMExpress will intelligently determine which tasks are suitable for Direct Data Flows and parallel processing - without user intervention.

In summary, DMExpress saves development time by reducing the time needed for testing & tuning jobs (ie. the developer doesn't need to spend time manually partitioning the data and then manually specifying which tasks are allowed to run in parallel), and also significantly cuts job processing time through the use of features such as Direct Data Flows which means your jobs will run up to twice as fast when compared to competing products.

A final point to note is that Direct Data Flows are only effective at the DMExpress Job level, so the intermediate files are still available for debugging purposes when a Task is run independently of the Job. It's also worth mentioning that some applications may require you to keep the intermediate files for restart and recovery purposes, in which case you won't need to set Direct Data Flows.

 

Add a Comment

Fields marked with an * are required




* Indicates a required field