[Pagina in italiano]
Lately I had to write a query that transform each record of a table into a serie of records. Specifically, from a table with two colums (first is a key, second is a list of car plates) I had to extract a result set with two colums: the key and a single plate. For all record that had more than one plate, I was asked to duplicate the record in order to have a single car plate as second column.
In other words:
postgres=# create temporary table t (key varchar primary key, plates varchar); postgres=# insert into t values ('000000','AA888BB CC777DD GG333JJ'), ('111111','ZZ888KK'); INSERT 0 2 postgres=# select * from t; key | plates --------+------------------------- 000000 | AA888BB CC777DD GG333JJ 111111 | ZZ888KK
what I wanted was:
key | plate --------+------------------------- 000000 | AA888BB 000000 | CC777DD 000000 | GG333JJ 111111 | ZZ888KK
The solution I found is:
postgres=# select key, unnest(regexp_split_to_array(plates, E'\\s+')) AS plate from t; key | plate --------+--------- 000000 | AA888BB 000000 | CC777DD 000000 | GG333JJ 111111 | ZZ888KK
What does this query? First, it convert a list (plates) into an array using a space separator, and second, it convert the arrary into a relation of possibly many records.
But, I didn’t like this solution, and moreover, I did not even undertand it. So, a few questions arose:
- how is it possible that postgresql allow me to specify a function tha return a set in the SELECT part instead of the FROM part? When I studied SQL, I learned that FROM is for specifying all my data sources (relations), and SELECT for specifying what to display and eventually how to format them.
- how does postgresql choose creating a cartesian product multiplying a first element (a single value “key”) and a second one (a relation “plate”)?
- how postgresql define this second relation that is not a fixed one since it depends on a filed taken from the current record? I.e., for each “key” there is a specific relation “plate”. Furthermore, if this is really a cartesia product, addin a new unnest would create 4×3 records. Let’s try:
postgres=# select key, unnest(regexp_split_to_array(plates, E'\\s+')) AS plate1, unnest(regexp_split_to_array(plates, E'\\s+')) AS plate2 from t; key | plate1 | plate2 --------+---------+--------- 000000 | AA888BB | AA888BB 000000 | CC777DD | CC777DD 000000 | GG333JJ | GG333JJ 111111 | ZZ888KK | ZZ888KK
- why this has not lead to a new cartesian product? May this be related to IMMUTABLE functions like, probably, the unnest one? (I think to remember that IMMUTABLE functions are functions that do not chage result when you call it using the same arguments. In this case postgresql would avoid to call them many time and directly use the result. But why it does not do a new product?) Let’s test is differently, with another array:
postgres=# select key, unnest(regexp_split_to_array(plates, E'\\s+')) AS plate1, unnest('{1,2}'::int[]) AS array2 from t; key | plate1 | array2 --------+---------+-------- 000000 | AA888BB | 1 000000 | CC777DD | 2 000000 | GG333JJ | 1 000000 | AA888BB | 2 000000 | CC777DD | 1 000000 | GG333JJ | 2 111111 | ZZ888KK | 1 111111 | ZZ888KK | 2
here, a cartesian product is made.
So, without much understanding of that is going on, I contacted a couple of mailing lists until I got an answer by Tom Lane (here).
The answer contains a few points:
- the best way to wite this query is to use the LATERAL subqueries, moving the function that create the arry from the SELECT to the FROM part, and to replace the two functions unnest+regexp_split_to_array with a single function regexp_split_to_table:
select key, targa from t, lateral regexp_split_to_table(plates, E'\\s+') as plate;
- utilizing functions that return a result set in SELECT part, is a hangover from Berkeley QUEL. This part of the code is not something Lane is happy. Moreover he notes that with postgresql 10, the code that manage these functions that returns more rows has been rewritten in order to better isolate it from other code (see the link in his email)
- the cartesian product is not made for every function called. The final number of records is base on the least common multiple of all periods of the functions results. This explain why using two function that produce 3 record, at the end produce a 3 records results, while using a function that produce 3 records and a function that produce 2 record, will end in a 6 records results.