How humans interact with large data sets

A smarter approach to filtering by using a Star Schema

Published in

Invoca Engineering Blog

7 min readMar 10, 2020

We live in a world that is bursting at the seams with data. According to sciencefocus.com, the largest online storage providers estimate their raw storage to be in the millions of terabytes. Humans aren’t able to effectively analyze large amounts of data all at once. That’s a job for AI. In order to see the bigger picture, we must focus our analysis down to no more than a few dimensions at a time.

In other words, we must sequentially analyze individual subsets of data. It follows then that our analysis is heavily reliant upon our ability to efficiently filter down to these subsets of data. Therefore, filtering plays an incredibly important role in our lives!

So let’s make sure we optimize it.

A Filtering Example

Something else that plays an important role in our lives is food. Not only do we need it, but we also need the right balance of it. We have to intelligently filter our food choices based on their nutritional values, and a diet is essentially a food filter.

Let’s say we want to start a balanced diet with healthy portions of macronutrients such as carbohydrates, fats, proteins, minerals, vitamins, and fiber. For the purposes of this illustration, we decide to order our food products online based on their nutritional categories.

We then find an online grocery store that supports filtering by category. Keep in mind that we are savvy shoppers, so we want to buy a modest portion of perishable food while still achieving proper nutritional balance. We essentially want to find foods that fall into multiple nutrient categories in order to be efficient with our purchases.

Not-So-Smart Filtering

Now that we understand this filtering example from an end-user perspective, let’s understand it from a developer perspective.

There are two modes of filtering which can be applied here:

Return results which match ANY of the selected categories
Return results which match ALL of the selected categories

Option #1: Results that match ANY selection

If we go with option #1, then we can only filter down as far as an entire category of food products. The filter below would return all high-protein foods (such as fish, chicken, bacon, etc).

SELECT * FROM food_products WHERE proteins=110 rows in set

However, we would not be able to achieve any granularity past a single category. For example, filtering by carbs and proteins would return all food products which either contain carbs OR proteins.

SELECT * FROM food_products WHERE carbs=1 OR proteins=125 rows in set

This filter would return high-carb foods (such as bread, rice, potatoes, etc) in addition to high-protein foods. The benefit is that we are almost certainly guaranteed to return a set of results no matter how many categories are selected.

Option #2: Results that match ALL selections

If we go with option #2, then we can filter down as far as a single food product and achieve maximum granularity. The filter below would return only high-protein foods that are also high in carbs (such as beans, peas, yogurt, etc).

SELECT * FROM food_products WHERE carbs=1 AND proteins=13 rows in set

The downside is that we might unknowingly apply an invalid combination of filters that may not return any results. For example, there are few, if any, foods that are high in ALL macronutrient categories. Therefore, selecting all categories might return zero results.

SELECT * FROM food_products WHERE carbs=1 AND proteins=1 AND fats=1 AND fiber=1 AND minerals=1 AND vitamins=1;0 rows in set

This leaves us with having to systematically “guess” which combination of filters work and which do not.

Experimenting with filter combinations is not a big deal unless the underlying dataset is massive. If the dataset is large, then each filtered query might take several seconds to report back. Again, waiting for queries is not the end of the world, but we would still like to optimize our filtering workflow as much as possible in order to be efficient with our time. We humans have better things to do with our lives!

A Smarter Approach: Conditional Filtering

Conditional filtering offers the best of both options. Results are guaranteed while also allowing for maximum granularity. However, it does require an extra query to be made.

After selecting a category, we update the list of remaining categories according to that selection. In other words, the remaining categories are guaranteed to occur alongside the initial selection.

If the high-protein food products available for purchase online are only high in either carbs or fats, then the remaining categories would not include any other categories besides carbs and fats because we know that all other combinations would be invalid and return zero results.

Conditional filter options for a single category selection

SELECT 
  SUM(food_products.carbs) AS carb_count,
  SUM(food_products.proteins) AS protein_count,
  SUM(food_products.fats) AS fat_count,
  SUM(food_products.fiber) AS fiber_count,
  SUM(food_products.minerals) AS mineral_count,
  SUM(food_products.vitamins) AS vitamin_count
FROM food_products
WHERE proteins=1;/* Macronutrient columns are boolean, so by summing them up we can see which columns have non-zero results and only include those columns as options in our conditional filter */
{ 
  protein_count: 10,
  carb_count: 3,
  fat_count: 4,
  fiber_count: 0,
  mineral_count: 0,
  vitamin_count: 0
}

Likewise, if we select both carbs and proteins but discover there are no food products available that contain carbs, proteins, AND fats, then we will no longer show the category option for fats. How neat is that!

Conditional filter options for multiple category selection

SELECT 
  SUM(food_products.carbs) AS carb_count,
  SUM(food_products.proteins) AS protein_count,
  SUM(food_products.fats) AS fat_count,
  SUM(food_products.fiber) AS fiber_count,
  SUM(food_products.minerals) AS mineral_count,
  SUM(food_products.vitamins) AS vitamin_count
FROM food_products
WHERE proteins=1 AND carbs=1;{ 
  protein_count: 3,
  carb_count: 3,
  fat_count: 0,
  fiber_count: 0,
  mineral_count: 0,
  vitamin_count: 0
}

Wait a minute…

If we’re guaranteeing results for certain filter combinations, wouldn’t this extra query take just as long as our original query? Wouldn’t we have to query across all of our food products?

Well. Yes… unless we thought ahead and decided to use a star schema.

Faster Queries with a Star Schema

What is a Star Schema?

Let’s say you have a table of food products and each row contains a name, a price, and various macronutrient information about each product. In this example, our macronutrient columns are booleans that indicate if a food product contains high amounts of that macronutrient.

If we look at the macronutrient columns (proteins-to-fibers), we can see that rows 2 and 3 have the exact same macronutrient information, and duplicated information means wasted space!

With a Star schema approach, there is one central table called a fact table and multiple dimension tables hanging off of it via foreign keys. The table association often looks like a star which is where the name comes from.

In our scenario, we will move the macronutrient information to its very own dimension table and create a foreign key association between our food products fact table and the macronutrient dimension table.

Notice how rows 2 and 3 from the food products fact table both reference the same macronutrient_dimension_id. Since the macronutrient values are the same, only one dimension row needs to be created. This row can then be referenced by any number of food products without duplicating any of the macronutrient information!

How does this help with filtering?

Whenever a food product is added, we either use an existing dimension row that has matching macronutrient information or we create a new dimension row. This guarantees two things:

The macronutrient dimension table will contain all macronutrient combinations across all food products.
The macronutrient dimension table will most likely have fewer rows than the food products table.

Therefore, instead of querying the food products table for filter combinations, we can instead query the smaller macronutrient dimension table which will result in faster queries for our conditional filter!

SELECT 
  SUM(macronutrient_dimensions.carbs) AS carb_count,
  SUM(macronutrient_dimensions.proteins) AS protein_count,
  SUM(macronutrient_dimensions.fats) AS fat_count,
  SUM(macronutrient_dimensions.fiber) AS fiber_count,
  SUM(macronutrient_dimensions.minerals) AS mineral_count,
  SUM(macronutrient_dimensions.vitamins) AS vitamin_count
FROM macronutrient_dimensions
WHERE proteins=1 AND carbs=1;{ 
  protein_count: 1,
  carb_count: 1,
  fat_count: 0,
  fiber_count: 0,
  mineral_count: 0,
  vitamin_count: 0
}

Conclusion

Faster queries allow for faster filter experimentation. We spend less time sifting through raw data like machines, and, instead, spend more time using our uniquely-human talents to piece together the bigger picture. Conditional filtering combined with a Star schema is just one way to achieve that.

Summary

Filtering is an important tool and worth optimizing.
Conditional filters vastly improve the user experience.
Querying smaller dimension tables allows for faster filter combination validation.

Thanks for reading! We hope that your understanding of database design and filtering optimization has increased.

If you have any questions, feedback, or further ideas on the subject, please leave us a comment!