Typical workloads running on a data warehouse system
Before we discuss what SQL workloads make sense to execute on hadoop system, let’s review and understand the typical workloads executing on data warehouses, data marts and operational data stores.
Reporting and Analysis Solutions: Many organizations started with their data warehousing and data mart system implementations by running batch-reporting types of workloads. As a second stage, they provided some limited ability to enable users to change report parameters and re-execute those reports semi-interactively. A major advance occurred when users were able to use a metadata layer to create their own reports to execute as needed. In this scenario, the IT organization typically defines some boundaries like the maximum number of rows allowed on a query result set but allows power users could to leverage BI and Reporting tools and to build their own reports and interact with data dynamically. BI and Reporting tools delivered significant improvements, empowering end-users to interact freely with data and build their own applications without depending on the IT organization. However, this freedom also made optimizing and guaranteeing SLAs on a system more challenging. DBAs and system administrators had to deal with this challenge on a daily basis to satisfy end-users.
Ad hoc and predictive analytics: To build predictive models, the data scientist or data mining analyst typically performs many complex data management tasks. In general, these processes consist of three major tasks: (1) data preparation, (2) model building and (3) scoring. During data preparation steps, there are many data-intensive analyses including statistical analysis, data sampling, creation of new data elements, aggregations, calculations, joins and others. To gain efficiency during the preparation process, many organizations have enabled their data scientists and analysts to use their data mining tools to connect directly to data warehouses or data marts and to push down some of those heavy activities to the databases via SQL pass-through commands. This type of workload certainly has a huge impact on performance of data warehouse systems as it typically deals with large tables and transactional data.
In-database data mining technology (aka in-database analytics): As analytics began to be adopted within the organization, data volumes increased and the typical data mining tools struggled to perform and scale using their traditional server architecture. Other factors such as data governance, data privacy, and data redundancy also contributed to forcing data mining providers to enable their tools to leverage the relational databases to perform deep analytics such as modeling, scoring, statistical analysis and other related activities. Data mining processes usually need to deal with large data sets, much longer data history, and preferably data closer to its natural form (raw transactional data). The ability to deploy data mining (analytics) on a data warehouse – bringing analytics to the data rather than the data to the analytics – was certainly an appealing solution that enabled many organizations to improve their productivity significantly. Data mining tools such as SAS, IBM SPSS and others had significant investments in integrating with relational databases to improve the performance and scalability of their solutions. In-database data mining provided a huge leap forward for many organizations, but it also contributed significantly to increasing workload complexity on the data warehousing and data mart systems, creating yet more work for DBAs and system administrators dealing with those very expensive queries in their systems.
Push down data transformations in relational databases: (aka running ETL on data warehousing systems, (aka ELT- Extract, Load and Transform). Similar to the in-database data mining, many organizations started to move their data preparation, cleansing and transformation workloads into data warehousing and data mart systems. Motivations for this approach include:
- Maximizing the use of the data warehousing and data mart infrastructure: Typically, these systems would only be fully utilized during the day to support end-user workloads, and the massive hardware infrastructure would be almost idle during off-user query time.
- Improving performance of data cleansing and transformations: Some ETL vendors had limited scalability with their ETL engines, so leveraging the MPP architecture and scalability of relational database systems was an excellent alternative to dealing with their limitations. Some other tools only offered SQL-based transformations, so they were completely dependent on the relational database engine to get the job done.
- Centralizing data management: Bringing ETL close to the data and avoiding data movement is appealing to any organization as it helps to reduce some data governance issues as well as overall performance.
There are other reasons why organizations moved ETL workloads into their data warehousing systems, but for the purpose of this blog, the most important considerations are the side effects of performing transformations inside vs. outside of the database. If customers were able to confine their heavy ETL workloads to nighttime (off-user access time), this approach would certainly have avoided some headaches for DBAs and system administrators. But, as business requirements changed due to reasons such as globalization (users querying the data warehousing systems around the clock) or because data ingestion was needed in near-real-time to support business needs, the ETL workloads started to be mixed with user query workloads.
Other types of workloads with completely different characteristics, such as operational analytics typically consisting of large number of small transactions (inserts, updates and queries with short durations), have also begun infiltrating data warehouse systems, further increasing complexity. Many organizations have segregated those workloads on different systems (ODS – Operational Data Store) to minimize the complexity for workload management and high availability reasons. The combination of these workloads creates a challenging task for administrators trying to optimize a data warehouse system.
Despite the advances in workload management tools, which have allowed administrators to segregate resources by grouping common applications to dynamically allocate resources and query priority, in many organizations the DBAs and system administrators struggle to meet the needs of their end-users. Data warehousing systems typically have finite and limited resources, and keeping the end-user community satisfied with their data warehousing and analytical systems is very difficult.
Optimizing a system with limited resources and complex workload is not an easy task, requiring a mix of science and art. The science consists of leveraging best practices and tools to use the right knobs to tune the system. The art consists of perception, feelings and guesses as administrators seek to predict how users will use the system on a daily basis. Query workloads may change at any given moment based on business dynamics. Database vendors provide new tools such as workload management and the ability to change query priority to enable DBAs to do a better job in tuning the system, but more complexity continues to arise.
DBAs and system administrators are forced to prevent certain types of queries from running on their systems to avoid compromising the health of the overall system and adversely impacting users. The most common approaches to keeping systems up and running are to: (a) reduce query priority for queries that consume too many resources, (b) block queries that consume too many resources from executing during certain times, (c) not allow certain queries to run at all, (d) migrate certain queries and processes to a different platform or technology, (e) limit the amount of data that a query or application can process even if the application requests all available data (e.g., limit to a single year instead of multiple years).
SQL-on-Hadoop: A good alternative to some of the common problems on existing data warehousing systems
Hadoop has emerged in recent years to address some of the common problems with data management platforms. The hadoop ecosystem has many different components, each having a particular purpose and to solve certain kinds of problems. In the SQL-on-Hadoop space, several tools available, including data types supported, file formats supported (e.g., Parquet, ORC, CSV), and SQL completeness (ANSI-92 vs SQL 2003/2013, etc.).
There are also differences in how SQL is enabled in the hadoop environment. Some SQL tools depend on external RDBMs to submit a SQL request to the Hadoop system and the result is returned to the relational RDBMs. Some tools deploy their entire RDBMS on the hadoop system (including storage layer and query engine), basically running their RDBMS on the hadoop cluster but using a proprietary metadata repository and data formats. Other vendors decided to build a MPP query engine that leverages the MPP architecture of hadoop and operates directly on the HDFS using native formats.
For the purpose of this article, I won’t cover any details and benefits among the current SQL hadoop offerings because it can be an extensive topic and probably better-suited for another article. Rather, I will discuss candidate SQL workloads that can be migrated from Data Warehouses or Data Marts into a hadoop system to best leverage the power of the SQL language.
As hadoop is a cost-effective and scalable platform to manage large volumes of data, many organizations are considering implementing hadoop as a data grid hub to serve multiple functions such as: (1) a data landing zone to store any kind of data (structured and unstructured) available to them, (2) as an on-line data archive repository to store data that has less frequent access and is too expensive to keep on traditional data warehouses/data marts but still needs to be available to some end users, (3) as a data exploration zone to enable power users to query any kind of data to uncover new insights and reveal new business opportunities, and (4) operational analytics workloads and others.
If we take the above scenarios into consideration, why not leverage SQL-based solutions and take the benefit of SQL skills that are out there along with many SQL tools for data manipulation? In the past two years, we have observed a huge improvement in SQL-on-hadoop offerings. Hive-13 has extended its SQL capabilities and also performance improvements. IBM recently released a new version of its BigSQL product that consists of an MPP SQL engine running natively over HDFS and leveraging the standard hadoop file formats. This engine has a robust query optimizer engine along with other enterprise-class features (workload management, security, etc.) and delivers extremely high performance for SQL-based workloads.
Data Integration and ETL-based Workloads: Many organizations are considering implementing hadoop to serve the functions of data landing, exploration, and archiving. The main goal is to create a central repository (also known as a data lake) and store any kind of data (structured, semi-structured and unstructured) on a commodity platform, with flexible architecture to enable easy consumption of data across the organization. This architecture also enables key data management functions such as data integration, cleansing and transformations to be executed in a very efficient manner.
ETL tool vendors are making a big effort to adapt their tools to leverage the hadoop ecosystem. This approach can be used to alleviate the data warehousing systems that are struggling to keep up with their workloads. Most ETL tools support SQL-based transformations, so there is great potential to reuse skills or even reuse code. This segment is evolving rapidly, and leveraging the hadoop platform for data integrations needs may soon become common practice.
Batch SQL-based Workloads: Many workloads on data warehouses are pure batch processes, such as data extraction, data exports, and batch reports. Those workloads do not have requirements for extremely high speed and performance. If a given user is not allowed to execute a complex query because the current infrastructure cannot support those extra workloads or those heavy and complex queries consume too many resources and thus impact SLAs of other important activities, then such workloads can be easily migrated or implemented on a SQL-on-Hadoop solution.
Data Preparation for Data Mining or Data Modeling: This kind of workload typically fits in the category of ETL or batch SQL-based workloads, but certain nuances are particular to this type of process. Data scientists or data miners typically need to work with large data volumes and create many new variables. This process, typically called data preparation for data mining, precedes the analytical modeling process (the process that creates a data mining model) and is an interactive, iterative process throughout model development. This kind of process may be very expensive when executed on a data warehouse system. In many cases, data scientists extract the data from the data warehouse and put it into a separate system (“sandbox”) where they can perform such activities without restrictions. They still need to extract the data from the data warehouse system, which might consist of joining multiples tables. These large data extractions are typically restricted by data warehousing workload management as it can impact the overall performance of the data warehouse system. SQL is the common language to perform such tasks (data extractions, data preparations), and most data mining tools leverage SQL language to perform such tasks, and thus they are also good candidates for leveraging SQL-on-hadoop solutions.
Data mining modeling and scoring workloads: The analytical modeling process consists of using a data mining algorithm (e.g., linear regression, decision tree classification, clustering) to build a data mining model. Data mining models sometimes are built using a subset (sample) of the data. The scoring process consists of applying the data mining model to another data set to predict an outcome or response for a new set of individuals or events (e.g., likelihood of responding to an offer) or to group them in certain ways (e.g., customer segmentation), depending on the type of model being used. Many organizations implement the scoring process by deploying data mining models within the relational database for high-volume, high-performance scoring. There is still some development to be done in this space, and existing data mining tools need to improve their solutions to run natively on hadoop systems and also fully to adopt native hadoop SQL engines. In the near future, this might became more popular, and we will see these workloads realize the benefits of SQL-on-hadoop engines. Open R has already made good progress in this direction, and it provides integration with Hive. Other data mining tools also provide integration with SQL-on-hadoop engines.
Ad hoc queries: It is probably too early to recommend that ad hoc workloads should move into a SQL-over-hadoop platform. If we take into consideration, however, that some historical analyses might not need the performance of a relational database, then we certainly can leverage the hadoop platform for such workloads. Especially in cases where the hadoop system is already storing historical archives, some workloads can easily fit into this context. As the SQL-on-Hadoop offerings start to improve query performance, we can expect to see more and more workloads moving in this direction.
Conclusion:The examples above are candidate workloads that can be considered for a SQL-on-Hadoop based solution. Leveraging SQL-on-Hadoop has obvious benefits, but of course other factors need to be considered. One important considerations is to define a strategy for the overall data architecture. As many applications will remain in the data warehousing systems for many years, there is a need to define a data flow and data strategy to minimize data movement and also to provide flexibility to users. While the hadoop ecosystem provides flexibility at scale with a very attractive cost for analytics and other data management needs, other important aspects such as data privacy, security, and data governance cannot be ignored, otherwise the benefits may be offset by the potential data privacy and security issues.