Informational Constraints in Databricks

Improve your data integrity in your data warehouse

May 24, 2023

person holding pencil near laptop computer — Photo by Scott Graham on Unsplash

Constraints

Constraints in databases are rules that ensure data integrity and consistency by enforcing certain conditions or restrictions on the data that can be stored in a table. The most common types of constraints in databases include:

Primary key constraint

This constraint ensures that each record in a table is unique by identifying a column or set of columns that uniquely identifies each record. It prevents duplicate records from being entered into the table.

Foreign key constraint

This constraint establishes a relationship between two tables by ensuring that the value in one table's column matches the value in another table's column. It ensures referential integrity and helps maintain data consistency across tables.

Unique constraint

This constraint ensures that each value in a column is unique, which means that no two rows can have the same value in that column. It prevents duplicate values in a table.

Check constraint

This constraint validates that a column's value meets a specific condition or a range of conditions, such as ensuring that a date is within a certain range, or a number is greater than a specific value. It helps ensure data accuracy and consistency.

Not-null constraint

This constraint ensures that a column does not contain null values, which means that the column must have a value for each row in the table. It helps ensure data completeness and consistency.

Data Lakes without Constraints

Data lakes pose significant challenges from a constraints perspective due to their flexible, schemaless nature. Some of the key challenges include:

Lack of data consistency

Data lakes often store data in its raw, unstructured format, making it challenging to enforce constraints that ensure data consistency. Without proper constraints, it is possible to store data that is incomplete, inaccurate, or inconsistent.

Difficulty in defining constraints

With data lakes, the schema is not predefined, making it difficult to define and enforce constraints. The process of defining constraints can be complicated and requires a deep understanding of the data, its sources, and its intended use.

Updating and maintaining constraints

As data in a data lake evolves and changes, constraints must be updated and maintained regularly. This requires a rigorous process of monitoring and validation to ensure that constraints continue to meet the changing needs of the data.

Lack of support for transactions

Data lakes do not support transactions, which can pose a challenge when trying to maintain data consistency. Transactions are essential for ensuring that data changes are committed together, and that the data remains consistent and accurate.

Data quality issues

Data lakes often contain data from a variety of sources, which can pose quality issues. Data must be cleaned, validated, and transformed to ensure that it is accurate and consistent. Without proper constraints, it can be difficult to identify data quality issues and correct them.

CHECK and NOT NULL

CHECK and NOT NULL constraints are features in Databricks that allow users to enforce data validation rules on data stored in their delta lake.

The check constraint is used to ensure that data being inserted into a table meets a specified set of conditions. This helps to prevent data inconsistencies or inaccuracies by validating data before it is stored in the data lake. The not null constraint, on the other hand, ensures that a column in a table cannot contain null values, improving query performance and reducing storage requirements.

Enforcing check and not null constraints on Databricks can help to improve data quality, security, and performance, while simplifying data management and analysis. Users can define their own rules for check constraints, based on the specific data being stored in the delta lake. Additionally, these constraints can be used to enforce security policies, such as data access restrictions or data encryption, ensuring that sensitive data is protected from unauthorized access or breaches.

Here is an example of orders in which order_id should not be null and quantity should be always more than zero.

CREATE TABLE orders (
   order_id INT NOT NULL,
   customer_id INT,
   product_name VARCHAR(50),
   quantity INT,
   order_date DATE,
   CONSTRAINT chk_quantity CHECK (quantity > 0)
);

You can also use ALTER TABLE ADD CONSTRAINT and ALTER TABLE DROP CONSTRAINT.

Finally, Primary Keys and Foreign Keys

Note: Primary key and foreign key constraints are supported in Databricks Runtime 11.1 and above. These constraints require both Unity Catalog and Delta Lake to be enabled in the Databricks environment in order to function properly.

In Unity Catalog tables, primary key and foreign key relationships can be established on fields to provide information on the relationships between tables.

It is essential to note that Unity Catalog is only informational and does not enforce these relationships.

When establishing a critical foreign relationship, the foreign key must reference a primary key in another table. This helps to ensure that the relationship between tables is valid and consistent.

Declaring primary and foreign keys can be included in the table specification clause during table creation. However, it is important to note that this clause is not allowed during CTAS (Create Table As Select) statements, which are used to create a new table based on the results of a SELECT statement.

Constraints can also be added to existing tables to further specify relationships and rules for data integrity. For example, a constraint could be added to ensure that a foreign key in one table matches the primary key in another table, preventing orphaned records and maintaining referential integrity.

Here is an example to create the table with primary and foreign keys.

%sql
CREATE TABLE Customers (
    CustomerID INT NOT NULL,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    CONSTRAINT pk_customers PRIMARY KEY(CustomerID)
);

CREATE TABLE Orders (
    OrderID INT NOT NULL,
    OrderDate DATE,
    CustomerID INT,
    CONSTRAINT pk_orders PRIMARY KEY(OrderID),
    CONSTRAINT fk_orders_customers 
        FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

Final Thoughts

Bringing foundational building blocks of database designs to data lakes makes it very effortless to have data integrity in big data lakes. In addition, making primary and foreign keys only informational makes it more comfortable to integrate these new features into existing implementations.

Thanks for reading!

🤝🏼 You can follow me on LinkedIn or Twitter.

💬 Always welcome your thoughts or a conversation below!

Rahul’s Newsletter

Discussion about this post

Ready for more?