What is ETL?

Extract. Transform. Load.

ETL Defined

Extract, Transform and Load  or ETL is a standard information management term used to describe a process for the movement and transformation of data. ETL is commonly used to populate data warehouses and datamarts, and for data migration, data integration and business intelligence initiatives.

ETL processes can be built by manually writing custom scripts or code, with the trade-off that as the complexity of the ETL operations increase, scripts become harder to maintain and update.

Alternatively, purpose built ETL software may offer a graphical user interface to build and run ETL processes, which typically reduces development costs and improves maintainability.

The Three Stages of ETL

Extract data from one or more source systems containing customer, financial, or product data.

Modern organizations have data stored in many disparate systems such as: customer relationship management (CRM), sales, accounting, and stock tracking to name just a few. Each system will typically store data in different mutually incompatible formats. To obtain business value from all this data means the ETL software you choose should have the ability to extract data from many different sources.

Connectors: Flat files, XML, Oracle, IBM DB2, SQL Server, Teradata, Sybase, Vertica, Netezza, Greenplum, IBM Websphere MQ, ODBC, JDBC, Hadoop Distributed File System (HDFS), Hive/HCatalog, JSON, Mainframe (IBM z/OS), Salesforce.com, SAP/R3

Transform the data by applying business rules, cleansing, and validating the data.

To combine and report on the data extracted in Stage 1, for example comparing orders from the order entry system with stock levels in the warehouse management system, may require multiple steps and many different operations. To meet the current and future requirements of the business, your ETL software should be able to perform all of the following types of operations:

  • Transforms: Aggregation, Copy, Join, Sort, Merge, Partition, Filter, Reformat, Lookup
  • Mathematical: +, -, x, /, Abs, IsValidNumber, Mod, Pow, Rand, Round, Sqrt, ToNumber, Truncate, Average, Min, Max
  • Logical: And, Or, Not, IfThenElse, RegEx, Variables
  • Text: Concatenate, CharacterLengthOf, LengthOf, Pad, Replace, ToLower, ToText, ToUpper, Translate, Trim, Hash
  • Date: DateAdd, DateDiff, DateLastDay, DatePart, IsValidDate
  • Format: ASCII, EBCDIC, Unicode

Load the results into one or more target systems such as a data warehouse, datamart, or business intelligence reporting system.

The data was extracted and transformed in the first two stages, now the resulting dataset must be loaded into a target system. This means your ETL software should be capable of connecting and loading data into a variety of targets.

Connectors: Flat files, XML, Oracle, IBM DB2, SQL Server, Teradata, Sybase, Vertica, Netezza, Greenplum, ODBC, JDBC, Hadoop Distributed File System (HDFS), Hive/HCatalog, Mainframe (IBM z/OS), Salesforce.com, Tableau, QlikView

The ETL Value Equation

A complete end-to-end ETL process may take a few seconds or many hours to complete depending on the amount of data and the capabilities of the hardware and software.

The cost-time-value equation for ETL is defined by three characteristics:

  • Volume – How much data needs to be processed? It may be hundreds of megabytes, terabytes, or even petabytes of data
  • Velocity – How fast and how frequently does the data need to be processed? Is there a service level agreement (SLA) that needs to be met?
  • Variety – What is the layout or format of the source data? How many different data sources need to be processed and combined? What is the format of the final output?

The selection of appropriate software, hardware and developer resources to meet these criteria will directly affect the overall cost and timeline of your ETL project.

Learn the 5 Must Have ETL Features

Get the Ultimate Checklist for High-Performance ETL

Whether you're dealing with petabytes of data, or just a few gigabytes, the sad truth is that most of it's going to waste if you don't have the right tools. High costs, endless tuning and performance limitations all stand in the way of realising real business insights.

But it doesn't have to be that way – whether or not you already have a set of ETL and Data Integration tools, this checklist will help you assess high-performance data integration solutions that stand up to today’s challenges.

Download this guide to learn how to evaluate solutions based on:

  • Developer productivity
  • Dynamic optimisation
  • Scalable architecture
  • Pervasive connectivity
  • High-speed compression

Thank you! Your download will begin now.

Oops! Something went wrong while submitting your details

You may also be interested in...

Need Help?

Struggling with data integration, ETL, Hadoop, dashboards? We can help.

Our ProductsContact us