What does NULL mean in an OUTER JOIN? ("unknown" vs. "not exists")
In a previous post, I briefly mentioned that I consider a NULL in SQL databases as an existing but unknown value instead of a not existing one. But there's an exception in outer joins where a non-existent row will have null values in the results. Let's clarify it with an example. It’s important to distinguish between columns with a null value and columns from inexisting rows where all values are null, including the key. Here is an example extracted from the traditional EMP/DEPT schema. I don't show the create table on purpose because the way it is traditionally built in SQL examples is incorrect normalization. yugabyte=# select * from emp order by 1 ; empno | ename | job | sal | comm -------+--------+-----------+------+------ 7499 | ALLEN | SALESMAN | 1600 | 300 7788 | SCOTT | ANALYST | 3000 | 7839 | KING | PRESIDENT | 5000 | 7844 | TURNER | SALESMAN | 1500 | 0 NULL in the commission column may have different meanings. ALLEN and TURNER are SALESMAN, and a commission is part of the package. The value exists and is known. The absence of a commission for TURNER is not a NULL but simply a zero amount, maybe because he didn't reach his objectives. KING is the PRESIDENT. He may have a commission, but he doesn't want to disclose it, so a NULL was inserted instead. The commission exists but is unknown in the system. SCOTT works as an ANALYST, and the package for this position doesn't include a commission. It's not a zero commission. The commission doesn't exist for this job. Some systems display that as N/A (not applicable). No commission has been inserted, but a NULL was stored because the table has this column and no default value. In SQL, a NULL signifies "unknown," but in this case, SCOTT has no commission, and this is known. This appears to contradict the definition of NULL. Nevertheless, the issue is not with SQL definitions but instead with this application's data model. The EMP table is not normalized. There's a dependency between the job and the existence of a commission. The commission should not belong to the employee as an attribute but a one-to-one association. Some jobs do not have a commission, so the association is more precisely a one-to-zero-or-one. If the commission is a row in another table, the absence of a row can represent its non-existence, and it is different from a row with a key but a null value that represents an existing but unknown value. Here is the normalized data model: create table employees ( primary key (empno) , empno int , job text , ename text , sal numeric ); create table commissions ( primary key (empno) , empno int references employees , comm numeric ); insert into employees values (7839,'PRESIDENT','KING' ,5000) ,(7788,'ANALYST', 'SCOTT' ,3000) ,(7499,'SALESMAN', 'ALLEN' ,1600) ,(7844,'SALESMAN', 'TURNER',1500) ; insert into commissions values (7839,null) -- KING commission is unknown (undisclosed) ,(7499,300) -- ALLEN package includes a commission, he has 300 ,(7844,0) -- TURNER package includes a commission but has none ; I still have four employees, but only three have a commission: yugabyte=# select * from employees order by 1 ; empno | job | ename | sal -------+-----------+--------+------ 7499 | SALESMAN | ALLEN | 1600 7788 | ANALYST | SCOTT | 3000 7839 | PRESIDENT | KING | 5000 7844 | SALESMAN | TURNER | 1500 (4 rows) yugabyte=# select * from commissions order by 1 ; empno | comm -------+------ 7499 | 300 7839 | 7844 | 0 (3 rows) The problem with SQL, compared to document databases, is that the result of a query must be flattened to one table. This is where an outer join appears to materialize the result. With my normalized model, when I want to list the employees with their commission I join them with an outer join: yugabyte=# select * from employees left outer join commissions using(empno) order by 1 ; empno | job | ename | sal | comm -------+-----------+--------+------+------ 7499 | SALESMAN | ALLEN | 1600 | 300 7788 | ANALYST | SCOTT | 3000 | 7839 | PRESIDENT | KING | 5000 | 7844 | SALESMAN | TURNER | 1500 | 0 (4 rows) Interestingly, that's the initial table I had at the beginning, but now it is a query result that doesn't have to be normalized. The joined tables were modeled correctly regarding nulls and inexisting rows, but the results have discarded this information. To get this information, I can add an expression to the projection that will differentiate an existing commission from a commission with an unknown value. The existing one has a null in all columns, including the key. The unknown one has a non-nullable key, like all rows in an SQL database: yugabyte=# select * , commissions.empno is not null as "commission?" from employees left outer join commissions using(empno) order by 1 ; empno | job | ename | sal | comm | commission? -------+-----------+--------+---
In a previous post, I briefly mentioned that I consider a NULL in SQL databases as an existing but unknown value instead of a not existing one. But there's an exception in outer joins where a non-existent row will have null values in the results. Let's clarify it with an example. It’s important to distinguish between columns with a null value and columns from inexisting rows where all values are null, including the key.
Here is an example extracted from the traditional EMP/DEPT schema. I don't show the create table on purpose because the way it is traditionally built in SQL examples is incorrect normalization.
yugabyte=# select * from emp order by 1
;
empno | ename | job | sal | comm
-------+--------+-----------+------+------
7499 | ALLEN | SALESMAN | 1600 | 300
7788 | SCOTT | ANALYST | 3000 |
7839 | KING | PRESIDENT | 5000 |
7844 | TURNER | SALESMAN | 1500 | 0
NULL in the commission column may have different meanings.
- ALLEN and TURNER are SALESMAN, and a commission is part of the package. The value exists and is known. The absence of a commission for TURNER is not a NULL but simply a zero amount, maybe because he didn't reach his objectives.
- KING is the PRESIDENT. He may have a commission, but he doesn't want to disclose it, so a NULL was inserted instead. The commission exists but is unknown in the system.
- SCOTT works as an ANALYST, and the package for this position doesn't include a commission. It's not a zero commission. The commission doesn't exist for this job. Some systems display that as N/A (not applicable). No commission has been inserted, but a NULL was stored because the table has this column and no default value. In SQL, a NULL signifies "unknown," but in this case, SCOTT has no commission, and this is known.
This appears to contradict the definition of NULL. Nevertheless, the issue is not with SQL definitions but instead with this application's data model.
The EMP table is not normalized. There's a dependency between the job and the existence of a commission. The commission should not belong to the employee as an attribute but a one-to-one association. Some jobs do not have a commission, so the association is more precisely a one-to-zero-or-one. If the commission is a row in another table, the absence of a row can represent its non-existence, and it is different from a row with a key but a null value that represents an existing but unknown value.
Here is the normalized data model:
create table employees (
primary key (empno)
, empno int
, job text
, ename text
, sal numeric
);
create table commissions (
primary key (empno)
, empno int references employees
, comm numeric
);
insert into employees values
(7839,'PRESIDENT','KING' ,5000)
,(7788,'ANALYST', 'SCOTT' ,3000)
,(7499,'SALESMAN', 'ALLEN' ,1600)
,(7844,'SALESMAN', 'TURNER',1500)
;
insert into commissions values
(7839,null) -- KING commission is unknown (undisclosed)
,(7499,300) -- ALLEN package includes a commission, he has 300
,(7844,0) -- TURNER package includes a commission but has none
;
I still have four employees, but only three have a commission:
yugabyte=# select * from employees order by 1
;
empno | job | ename | sal
-------+-----------+--------+------
7499 | SALESMAN | ALLEN | 1600
7788 | ANALYST | SCOTT | 3000
7839 | PRESIDENT | KING | 5000
7844 | SALESMAN | TURNER | 1500
(4 rows)
yugabyte=# select * from commissions order by 1
;
empno | comm
-------+------
7499 | 300
7839 |
7844 | 0
(3 rows)
The problem with SQL, compared to document databases, is that the result of a query must be flattened to one table. This is where an outer join appears to materialize the result.
With my normalized model, when I want to list the employees with their commission I join them with an outer join:
yugabyte=# select *
from employees
left outer join commissions
using(empno)
order by 1
;
empno | job | ename | sal | comm
-------+-----------+--------+------+------
7499 | SALESMAN | ALLEN | 1600 | 300
7788 | ANALYST | SCOTT | 3000 |
7839 | PRESIDENT | KING | 5000 |
7844 | SALESMAN | TURNER | 1500 | 0
(4 rows)
Interestingly, that's the initial table I had at the beginning, but now it is a query result that doesn't have to be normalized. The joined tables were modeled correctly regarding nulls and inexisting rows, but the results have discarded this information.
To get this information, I can add an expression to the projection that will differentiate an existing commission from a commission with an unknown value. The existing one has a null in all columns, including the key. The unknown one has a non-nullable key, like all rows in an SQL database:
yugabyte=# select *
, commissions.empno is not null as "commission?"
from employees
left outer join commissions
using(empno)
order by 1
;
empno | job | ename | sal | comm | commission?
-------+-----------+--------+------+------+-------------
7499 | SALESMAN | ALLEN | 1600 | 300 | true
7788 | ANALYST | SCOTT | 3000 | | false
7839 | PRESIDENT | KING | 5000 | | true
7844 | SALESMAN | TURNER | 1500 | 0 | true
(4 rows)
In an SQL table, the columns belonging to the primary key cannot be null because the key must be known to identify the row. When the key columns become null in a query result, they result from the row's absence, which denotes inexisting information rather than unknown information.
If you store employees and their commissions in one table, you denormalize it and lose important information. This is why it seems that a NULL can have meanings beyond just unknown information. With a proper data model, the meanings are differentiated: a null column value indicates an existing but unknown value and a non-existing row signifies the absence of information.