The Grouparoo Blog
When preparing your data set for analysis, it is crucial to ensure that your data set is both complete and accurate. One step in this process is deciding how to handle null values. Depending on how your data is going to be used, you may not want null values at all!
Let's clean some data
We're going to take a look at calculating Lifetime Value (LTV) of a customer. We want a customer with no purchase history to have an LTV of 0. What happens if our aggregated column returns no rows, and thus is null
? Should we clean the data at the query level or afterwards using a script?
We’re going to walk through this LTV example from beginning to end looking at different options -- some that work well, some that work poorly, and some that won’t work at all.
To start, we have a users
table and a purchases
table.
users:
user_id |
---|
1 |
2 |
3 |
4 |
5 |
purchases:
user_id | price |
---|---|
1 | 79.36 |
3 | 42.97 |
1 | 33.14 |
2 | 16.48 |
4 | 96.41 |
As you can see, the users table has 5 unique users and the purchases table has 5 purchases from users 1, 2, 3, and 4. User 5 is not present in the purchases table. Now that we have our data, it’s time to create our calculated column. Our query is being called by a function that passes it a list of user_ids that can be used in the query.
Option 1: With a SQL Query
Using COALESCE
COALESCE
is a keyword used by Postgres and other SQL dialects to return the first non-null value in a set. Because of this, it’s a common keyword for data cleaning. Let’s look closer at the following query.
SELECT
COALESCE(SUM(purchases.price), 0) as LTV,
user_id
FROM purchases
WHERE purchases.user_id
IN (1, 2, 3, 4, 5)
GROUP BY user_id
ORDER BY user_id;
When we look at the results from that query we end up with:
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
Yikes! No user 5. That means wherever you’re using this data, when you try and access the row for user 5, you’ll get null
, which may throw off calculations. So why did COALESCE
do this?
A SQL query is like checking a post office box -- you might find a bunch of envelopes containing various amounts of mail, or even an empty envelope I suppose. But if there’s no mail at that post office box, you leave empty handed. If SQL can’t find any rows, it comes back empty handed as well. In fact, getting back a null vs a 0 vs an undefined may seem similar, but they tell us different things:
value | meaning |
---|---|
0 | We know something is there and its value is zero. |
null | We know something is there and its value is unspecified |
empty set | We know nothing is there |
So the issue with our query isn’t with the COALESCE
but with the order of execution - there are no rows to return with a user_id
of 5 in purchases
. Let's think about the logical order of how we get there:
The order that query gets executed in is:
- Find rows in
purchases
that match the given user ids (5 rows, matching 4 of the given user ids) - On each of the rows returned, return either the sum of the price column or 0 (same 5 rows, with aggregated data for LTV)
- Group by user_id and order by user_id (4 rows returned, in order by user_id, with aggregated data for LTV)
After the first line, there isn’t even a row for user 5 anymore. We can expect the same behavior if we try a CASE
switch for our LTV because, again, there is no row, no data for it to act on.
Had our purchases
table contained a row for user_id 5 with null
for price, it would have returned as expected. Knowing that, a second option is to break out a right outer join.
RIGHT OUTER JOIN
Knowing that our issue was with missing rows, we could go the route of a RIGHT OUTER JOIN
with our users table to achieve that. This way, we’ll get back a row for every user from users
, and, if it’s there, the sum of their data from the “price” column in purchases
, otherwise a 0.
SELECT
users.id as user_id,
COALESCE(SUM("price"), 0) as LTV
FROM purchases
RIGHT OUTER JOIN users
ON demo.purchases.user_id = users.id
GROUP BY user_id
ORDER BY user_id;
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
It works! Fantastic.
Your primary data may be spread across multiple tables -- for instance, you may have a setup like:
location_a_customers
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
location_b_customers
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
that you want to use with a purchases table:
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
purchase_amount | decimal | not null | ||
item_sku | string | not null | ||
item_category | string | not null |
Fear not, we can still make it happen. Though, for everyone’s sake, at this point I would recommend using a function over using SQL! Just for giggles, let's prove it's possible first.
UNION
with a list
One way around the issue of having multiple users tables would be to create a list of all the user_id
s for the query with a sublist of (user_id
, 0), then UNION
the original query with that list. If it sounds overkill, that’s because it is:
SELECT
DISTINCT on (user_id) user_id,
purchases FROM( SELECT user_id, COUNT(*) as LTV
FROM purchases
WHERE purchases.user_id in (1, 2, 3, 4, 5)
GROUP BY user_id
UNION
SELECT * FROM (
values (1, 0), (2, 0), (3, 0), (4, 0), (5, 0))
AS t (user_id, LTV)
ORDER BY user_id, LTV DESC
) AS data;
Which returns our beloved 0:
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
The query isn’t too out of hand for 5 users, but imagine having to write that out (or have a function write that out) for 500 users! What about 10,000! Just because something works, doesn’t make it necessarily the right idea. So let’s try it from the programming sde.
Option 2: After the SQL Query 🥇
We’re really asking a lot of SQL there. While it was possible, it was not very pretty, very concise, or very readable. It also muddies things up as it is technically returning data from our database that isn’t there. Thankfully, transforming the data we do get back using a function is a much more straightforward task!
I worked this one out in both TypeScript and Python. I love that we use TypeScript for building out our data pipeline because dynamic typing helps keep the guard rails on and make sure I’m paying attention to what I’m working with -- especially as our data structures grow in complexity, it helps to be completely sure what you’re working with. And Python for all the Python-loving data folks as well.
Here’s how we can use a function to transform our data to include 0s. We start by setting the LTV of every id as [0], then iterate through our results and replace the [0]s as we go:
TypeScript:
interface Profile {
id: string;
}
interface DataResonseRow {
id: string;
data: number[];
}
function calculateLTV(
queryResults: DataResponseRow[],
aggregationMethod: string,
profiles: Profile[]
) {
const responsesById: { [id: string]: any[] } = {};
for (const i in profiles) {
if (aggregationMethod === "count" || aggregationMethod === "sum") {
//default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responsesById[profiles[i].id] = [0];
}
}
for (const row in queryResults) {
responsesById[queryResults[row].id] = queryResults[row].data;
}
return responsesById;
}
Python:
def calculate_ltv (query_results, aggregation_method, profiles):
responses_by_id = {}
for i in profiles:
if aggregation_method == "count" or aggregation_method == "sum":
# default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responses_by_id[i["id"]] = [0]
for row in query_results:
responses_by_id[row["id"]] = row["data"]
return responses_by_id
mock_query_results = [{"id":1, "data":[112.50]}, {"id":2, "data":[16.48]}, {"id":3, "data":[42.97]}, {"id":4, "data":[96.41]}]
mock_aggregation_method = "sum"
mock_profiles = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}]
print(calculate_ltv(mock_query_results, mock_aggregation_method, mock_profiles));
To see them in action, here’s a jsfiddle for the TypeScript solution and an IDEOne instance for the Python solution.
Learn More:
- To learn more about handling conditionals and aggregations in Postgresql, check out their docs
Tagged in Engineering
See all of Teal Larson's posts.
Teal is a Full Stack Engineer at Grouparoo, an open source data framework that easily connects your data to business tools. Teal loves building products that solve real-world needs. If she isn't coding, Teal is probably riding her bike or nose-deep in a book
Learn more about Teal @ https://www.tealjulia.com
Get Started with Grouparoo
Start syncing your data with Grouparoo Cloud
Start Free TrialOr download and try our open source Community edition.