SQL - Analysis of Iowa Liquor Sales dataset
This blog post is an introduction to the analysis of the Iowa Liquor Sales dataset using SQL which illustrates a right way and a wrong way for queries. The Iowa Liquor Sales dataset is a publicly available dataset. This extract used below is not the entire dataset but a sample of records stored in the General Assembly server(also available publicly). There are four tables I have used- counties, products, sales and stores. This data model has been borrowed from the General Assembly course content.
This post starts with one question and I will be using a similar approach to explain a right way and one common mistake in the way a query is approached.
Q1 - What is the category id and name with highest amount of sales?
-- Sales is defined as btl_price*bottle_qty
SELECT category, category_name, CAST(sum(btl_price*bottle_qty) as DECIMAL) as highest_salesfrom salesgroup by category, category_nameorder by category, category_name desclimit 1;
What is the order of operations in the above query(or any query)?
STEP 1: (FROM) The from clause is executed first and since we have only one table, the optimizer points to sales table.
STEP2 : (WHERE)The where statement is executed and since the query does not have a where clause, no filters are applied to the data in the database
STEP3: (GROUP BY)The group by statement is executed next and the tables is partitioned into a combination of category and category names (here the relationship between category and category names is 1:1)
STEP4: (HAVING) The having clause is applied next to filter rows after the group by. Now, one might ask what is the difference between a HAVING and a WHERE clause. WHERE filters records. before the group by is applied. So, for instance, if we wanted the highest amount of sales in each category considering only between certain dates, then WHERE is used. HAVING is not applicable here and will be explained later.
This also means that WHERE cannot be used with aggregate functions such as SUM,COUNT, AVG, MIN and MAX. More on this later.
STEP5 : (SELECT) SELECT statement is executed
STEP6: (ORDER BY) ORDER BY sorts the order of the result based on the keys category and category name
A common mistake in approach is highlighted in RED
SELECT category, category_name, max(btl_price*bottle_qty) as max_sales from sales
group by category, category_name
order by max_sales desc
limit 1;
The above code is a very common mistake for the given question because it selects the sales table, groups by the category and then selects the max(btl_price*bottle_qty) within that category, then orders by max_sales This gives the row with maximum sales within each category and picks the highest among them.