Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

Information about Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

Kevin David

Today, we are excited to announce that Databricks SQL will use the ANSI standard SQL dialect by default. This follows the announcement earlier this month about Databricks SQL’s record-setting performance and marks a major milestone in our quest to support open standards. This blog post discusses how this update makes it easier to migrate your data warehousing workloads to Databricks lakehouse platform. Moreover, we are happy to announce improvements in our SQL support that make it easier to query JSON and perform common tasks more easily.

Migrate easily to Databricks SQL

We believe Databricks SQL is the best place for data warehousing workloads, and it should be easy to migrate to it. Practically, this means changing as little of your SQL code as possible. We do this by switching out the default SQL dialect from Spark SQL to Standard SQL, augmenting it to add compatibility with existing data warehouses, and adding quality control for your SQL queries.

Standard SQL we can all agree on

With the SQL standard, there are no surprises in behavior or unfamiliar syntax to look up and learn.

String concatenation is such a common operation that the SQL standard designers gave it its own operator. The double-pipe operator is simpler than having to perform a concat() function call:

SELECT
  o_orderstatus || ' ' || o_shippriority as order_info
FROM
  orders;

The FILTER clause, which has been in the SQL standard since 2003, limits rows that are evaluated during an aggregation. Most data warehouses require a complex CASE expression nested within the aggregation instead:

SELECT
  COUNT(DISTINCT o_orderkey) as order_volume,
  COUNT(DISTINCT o_orkerkey) FILTER (WHERE o_totalprice > 100.0) as big_orders -- using rows that pass the predicate
FROM orders;

SQL user-defined functions (UDFs) make it easy to extend and modularize business logic without having to learn a new programming language:

CREATE FUNCTION inch_to_cm(inches DOUBLE)
RETURNS DOUBLE RETURN 2.54 * inches;

SELECT inch_to_cm(5); -- returns 12.70

Compatibility with other data warehouses

During migrations, it is common to port hundreds or even thousands of queries to Databricks SQL. Most of the SQL you have in your existing data warehouse can be dropped in and will just work on Databricks SQL. To make this process simpler for customers, we continue to add SQL features that remove the need to rewrite queries.

For example, a new QUALIFY clause to simplify filtering window functions makes it easier to migrate from Teradata. The following query finds the five highest-spending customers in each day:

SELECT
  o_orderdate,
  o_custkey,
  RANK(SUM(o_totalprice)) OVER (PARTITION BY o_orderdate ORDER BY SUM(o_totalprice) DESC) AS rank
FROM orders
GROUP BY o_orderdate, o_custkey
QUALIFY rank <= 5; -- applies after the window function

We will continue to increase compatibility features in the coming months. If you want us to add a particular SQL feature, don’t hesitate to reach out.

Quality control for SQL

With the adoption of the ANSI SQL dialect, Databricks SQL now proactively alerts analysts to problematic queries. These queries are uncommon but they are best caught early so you can keep your lakehouse fresh and full of high-quality data. Below is a selection of such changes (see our documentation for a full list).

  • Invalid input values when casting a STRING to an INTEGER
  • Arithmetic operations that cause an overflow
  • Division by zero

Easily and efficiently query and transform JSON

If you are an analyst or data engineer, chances are you have worked with unstructured data in the form of JSON. Databricks SQL natively supports ingesting, storing and efficiently querying JSON. With this release, we are happy to announce improvements that make it easier than ever for analysts to query JSON.

Let’s take a look at an example of how easy it is to query JSON in a modern manner. In the query below, the raw column contains a blob of JSON. As demonstrated, we can query and easily extract nested fields and items from an array while performing a type conversion:

SELECT
  raw:customer.full_name,     -- nested field
  raw:customer.addresses[0],  -- array
  raw:customer.age::integer,  -- type cast
FROM customer_data;

With Databricks SQL you can easily run these queries without sacrificing performance or by having to extract the columns out of JSON into separate tables. This is just one way in which we are excited to make life easier for analysts.

Simple, elegant SQL for common tasks

We have also spent time doing spring cleaning on our SQL support to make other common tasks easier. There are too many new features to cover in a blog post, but here are some favorites.

Case-insensitive string comparisons are now easier:

SELECT
  *
FROM
  orders
WHERE
  o_orderpriority ILIKE '%urgent'; -- case insensitive string comparison

Shared WINDOW frames save you from having to repeat a WINDOW clause. Consider the following example where we reuse the win WINDOW frame to calculate statistics over a table:

SELECT
  round(avg(o_totalprice) OVER win, 1) AS price,
  round(avg(o_totalprice) OVER win, 1) AS avg_price,
  min(o_totalprice) OVER win           AS min_price,
  max(o_totalprice) OVER win           AS max_price,
  count(1) OVER win              AS order_count
FROM orders
-- this is a shared WINDOW frame
WINDOW win AS (ORDER BY o_orderdate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING);

Multi-value INSERTs make it easy to insert multiple values into a table without having to use the UNION keyword, which is common most other data warehouses:

CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);

-- this is a multi-valued INSERT
INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
       ('Evan', 'Sales', 32000, 38),
       ('Fred', 'Engineering', 21000, 28);

Lambda functions are parameterized expressions that can be passed to certain SQL functions to control their behavior. The example below passes a lambda to the transform function, concatenating together the index and values of an array (themselves an example of structured types in Databricks SQL).

-- this query returns ["0: a","1: b","2: c"]
SELECT
  transform(
    array('a','b','c'),
    (x, i) -> i::string || ': ' || x -- this is a lambda function
  );

Update data easily with standard SQL

Data is rarely static, and it is common to to update a table based on changes in another table. We are making it easy for users to deduplicate data in tables, create slowly-changing data and more with a modern, standard SQL syntax.

Let’s take a look at how easy it is to update a customers table, merging in new data as it arrives:

MERGE INTO customers    -- target table
USING customer_updates  -- source table with updates
ON customers.customer_id = customer_updates.customer_id
WHEN MATCHED THEN
  UPDATE SET customers.address = customer_updates.address

Needless to say, you do not sacrifice performance with this capability as table updates are blazing fast. You can find out more about the ability to update, merge and delete data in tables here.

Taking it for a spin

We understand language dialect changes can be disruptive. To facilitate the rollout, we are happy to announce a new feature, channels, to help customers safely preview upcoming changes.

When you create or edit a SQL endpoint, you can now choose a channel. The “current” channel contains generally available features while the preview channel contains upcoming features like the ANSI SQL dialect.

To test out the ANSI SQL dialect, click SQL Endpoints in the left navigation menu, click on an endpoint and change its channel. Changing the channel will restart the endpoint, and you can always revert this change later. You can now test your queries and dashboards on this endpoint.

You can also test the ANSI SQL dialect by using the SET command, which enables it just for the current session:

SET ANSI_MODE = true; -- only use this setting for testing

SELECT CAST('a' AS INTEGER);

Please note that we do NOT recommend setting ANSI_MODE to false in production. This parameter will be removed in the future, hence you should only set it to FALSE temporarily for testing purposes.

The future of SQL at Databricks is open, inclusive and fast

Databricks SQL already set the world record in performance, and with these changes, it is standards compliant. We are excited about this milestone, as it is key in dramatically improving usability and simplifying workload migration from data warehouses over to the lakehouse platform.

Please learn more about changes included in the ANSI SQL dialect. Note that the ANSI dialect is not enabled as default yet for existing or new clusters in the Databricks data science and engineering workspace. We are working on that next.

Breaking Story – Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

The Latest News on Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

Source link
Category – Big Data

Leave a Reply

Your email address will not be published. Required fields are marked *