In this article I’ll present some approaches to SQL query format and coding that I use. Nothing radically different, just some rules that I follow that may improve your development and testing productivity.
During the testing phase, tweaking and commenting out slabs of code is my main activity, so having a structure that enables easy commenting out and modification is really helpful.
When coding, my 3 main rules are:
1. Ensure consistency – indentation and formatting consistency enables faster visual inspection of large files to identify the problematic section of code.
2. Add comments where needed – at the start, explain what you’re trying to achieve; for complex logic, give information that might help the unfamiliar reader; and at each join, summarise the attributes you need from the joined table.
3. Minimise – try to produce the least amount of code that will achieve the objective – the less code written means there is less chance of there being a bug in your code. For example, if you have a query that joins to a table multiple times, a better approach might be to create a sub-select for that table with a group by on the table identifier. I’ve often used this approach to reduce the lines of code in my scripts, with the added benefit of improved query performance.
Development Tool Selection
Finding the best development tool that works for you and your database platform is really important for productivity. I developed databases on Teradata for over 10 years and used Teradata SQL Assistant, mainly due to its syntax highlighting and more importantly, the inbuilt query history, which was a big benefit. I had years of query history stored and readily available to assist colleagues when they needed advice on using a table when I’d worked with it in the past.
I’ve been using Azure and AWS cloud databases as well as Snowflake for many years now, and my current development tool of choice is Microsoft Visual Studio Code. It’s a real pleasure to use VS Code when working with Azure databases as Microsoft’s SQL Server extension has excellent intelli-sense and ‘as-you-type’ error detection, which greatly increases debugging efficiency. Snowflake’s vendor-provided VS Code extension is also quite good, enabling seamless log-in and query history, but as of mid-2024, it doesn’t have ‘as-you-type’ error detection functionality, so it’s a lot more important to format your code cleanly in order to debug the syntax errors that occur. In addition, Snowflake’s syntax errors (again as of mid-2024 so this may improve in future) are quite often really unhelpful, and when working with huge Stored Procedures, I can’t impress enough how important it is to have well-structured code.
Most development tools will have a built-in SQL formatter. There are online tools and even the AI chatbots can do a serviceable job of formatting (provided your employer/client policy allows code to be shared online in this way). If these tools work for you, I’d encourage you to continue using them, as this will provide you with consistency of indenting and formatting. Personally, I’ve not yet found a formatting tool that produces output I like, so for now I’ll continue hand-crafting my SQL Code.
For anyone interested in VS Code SQL Formatters, you might like to try SQLTools, Prettier SQL VSCode and pgFormatter.
Example Query Structure
Select Clause samples:
For a simple select statement with no joins that outputs only a few columns, a query that you’d see in chapter 1 of any Introduction to SQL book or webpage is totally fine:
Here’s a larger query with joins, indenting, newlines, use of table aliases and use of comments.
Two separate approaches are presented in the query below for filtering on ‘Marketing Declined’ customers. I prefer the WHERE NOT EXISTS approach shown at the end as it results in a minor reduction in lines of code. The LEFT JOIN approach also has the potential to introduce multiple records into the output if there was an issue with the customer_attribute table having multiple ‘high dated’ records (records with effective_to_date = ‘9999-12-31’) per Customer, for the ‘MARKETING’ attribute name. Note though that the ROW_NUMBER filter at the end of the query would ultimately avoid multiple records per Customer being output by the query.
Formatting rules exhibited in the above query:
• Keywords capitalisation. Lower-case seems to be the common approach for many developers, however I feel that upper-case keywords help to quickly focus on the main constructs of the statement (SELECT, FROM, JOIN, WHERE).
• Indenting of column names in the select list after the SELECT statement. This helps to visually scan past the columns to find the From clause associated with the Select statement.
• Column/Table name capitalisation. Most databases ignore case except when the field is within delimiters such as double quotes. Personally, I prefer my SQL code to match the column name capitalisation defined within the database DDL, but lower-case seems to be the common approach for many development teams.
• Newline per column in the select list. For a small number of output columns, I don’t bother with newlines, but for large select lists, I always use newlines to enable the easy adding, removing and re-arranging of output columns.
• Leading comma before column name. This allows you to easily comment out columns during query testing by placing a double-hyphen before the comma.
• Case statement indenting. Indenting the When/Then/Else expressions of a Case statement enables easier commenting out and re-arranging. Having the End clause line up vertically with the Case clause can be handy to see the matching elements of the statement and identify where logic and structure errors have occurred.
•Use COALESCE instead of IFNULL, NVL as COALESCE is supported across a wider range of databases.
• Use CASE instead of IIF, ISNULL for derivation logic consistency and support across a wider range of databases.
• While I recommend you use COALESCE and CASE statements for wider support across databases, I strongly recommend that you familiarise yourself with your database’s vendor-specific SQL extensions. For example, Snowflake includes the ILIKE, EXCLUDE, REPLACE and RENAME clauses within the SELECT statement. Vendor-specific SQL extensions can be a huge productivity booster and are a necessity when working with JSON-formatted strings which is common when working with data ingested from SalesForce – For example: Redshift allows casting of JSON strings to Super datatype and Snowflake TRY_PARSE_JSON function both return objects which can be traversed using dot and bracket notation, a major time-saver and massive simplification code for JSON handling within SQL.
• Splitting long lines of code. For column derivations that are really long, break the line up and use indenting on the second and subsequent lines to make it clear where the column derivation ends.
• JOIN vs INNER JOIN, LEFT JOIN vs LEFT OUTER JOIN. I don’t bother with adding the INNER/OUTER keyword. If you need to change a JOIN to a LEFT JOIN or vice versa, it’s easier to modify when you don’t have the unnecessary INNER/OUTER keyword.
• Put Inner Joins first, and all other Join types (Left, Right, Full Outer) after that. Inner Joins that will reduce the records output by your query so they should be done as early as possible. Your database query optimiser may choose to apply filters and joins in a different order, but when you’re testing a query, you want to ensure that the smallest subset of records is produced as quickly as possible, and additional joins should continue to try to minimise those record counts. Likewise, if your query has filter conditions within a Where clause that can be implemented as part of an Inner Join earlier on in your query, then move the filter conditions to the Inner Join in order to reduce row counts earlier in your query.
• Newline for each Join condition. Many SQL formatters will place the ON clause on the same line as the JOIN clause in a query. I prefer the ON clause to be on a new line and indented, and additional join criteria on separate lines also indented, with each line starting with the AND/OR operator. This makes all filter criteria for the Join easy to identify and also easy to comment out the full line when the AND/OR is at the start of the line.
• Avoid unnecessary use of parentheses. Some SQL formatters will add parentheses around the conditions of a JOIN clause. Eg JOIN b ON (A = B and A = C). I prefer not to use parentheses in this way because if you intend to comment out the last filter criteria, you would have to move the closing parenthesis to a new line to avoid it also being commented out. Similarly, I would only use parentheses in expressions when they require the use of both the AND and OR operators to ensure the correct order of expression evaluation. For example:
• Some of my colleagues add “WHERE 1=1” in each of their query WHERE clauses, followed by “real” filter criteria. This enables faster commenting out of the first filter criteria. For example:
• Avoid use of DISTINCT. SELECT DISTINCT queries always make me wary that the previous developer hasn’t analysed their query output to determine whether and why there might be duplicate records in the output. If you are seeing multiple records for an identifier when you don’t expect it, analyse the data further to see where the multiple records are being introduced and whether you require further filter criteria. At some client sites, I’ve seen ETL issues resulting in multiple active records when there should only be one record. When the ETL fix can take months to implement, sometimes it’s most expedient to implement a filter using the window function ROW_NUMBER() OVER (PARTITION BY …. ORDER BY) to filter on row number 1 to avoid these ‘duplicates’. Identifying the correct ORDER BY expression within the ROW_NUMBER() window function means that you can ensure a deterministic output of the most appropriate single record from the original multiple records per identifier.
I hope this article has provided some useful ideas that you might implement in future, and if you have feedback on any of these suggestions or further suggestions of your own, please leave a comment.
While writing this article, I found the following top google hits for ‘SQL Best Practices’ to be good reading:
• SQL Best Practices – Brandon Southern – you’ll notice some differences between his best practices and my rules above. One additional item that I disagree on is for GROUP BY clauses – where the database platform supports it (MS SQL Server doesn’t), I will use column numbers in the group by clause rather than column names – for ease of query modification. Because I often use GROUP BY column numbering, I usually place aggregate functions as the last items in my select lists.
• Best practices for writing SQL queries – metabase.com
• 7 Bad Practices to Avoid When Writing SQL Queries for Better Performance – Abdelrahman Mohamed Allam
Zenon Skuza is a software engineer with many years of experience in the Telecommunications and Banking industries. He enjoys finding good restaurants and travelling with family & friends, and tries to stay fit by jogging weekly at his local ParkRun and the occasional paddle on a surfboard at beaches along Victoria’s Surf Coast.