CARVIEW |
7 Advanced SQL Techniques for Data Manipulation in Data Science
Can SQL be used for advanced data manipulation in data science? It sure can with these seven techniques.

Image by Author
SQL is one of the most used tools in data science for data manipulation. No wonder since SQL was created to query and manipulate data in databases. Because of its purpose, it offers a wide range of techniques for data manipulation.
Some belong to a more advanced spectrum and can be very useful in data science.
1. Subqueries and Correlated Subqueries
An SQL subquery is a query within another (main) query. They are typically used in the SELECT statement but can also be used in INSERT, UPDATE, and DELETE. Subqueries can be used in FROM, WHERE, HAVING, and JOIN clauses.
A correlated subquery is a special type of subquery that depends on the results of the main query.
When to Use Them:Â
- Filtering
- Calculations
- Restructuring data
- Row-by-row data evaluations (correlated subqueries)
- Dynamic lookups (correlated subqueries)
2. Common Table Expressions (CTE)
CTE is a temporary result set that can be referenced in SELECT, INSERT, UPDATE, or DELETE statements. In many cases, they are often nothing but neatly written subqueries. However, one significant difference between them is that CTEs are reusable in the main query, unlike subqueries.
When to Use Them:
- Recursive queries
- When the same ‘subquery’ result needs to be used across multiple steps
- Breaking down complex queries into smaller logical components
3. Recursive Queries
In SQL, recursive queries are written in recursive CTEs. A recursive query references itself, making it perfect for querying hierarchical and graph data structures.
When to Use Them:
- Finding descendants in a hierarchical structure (e.g., organizational chart or a product category tree)
- Calculating hierarchical paths (e.g., finding the reporting chain from an employee to the CEO)
- Generating sequential data
- Traversing graphs (e.g., finding all possible routes between locations in a transportation network)
- Nested totals (e.g., sales per product, product category, and grand total)
4. Window Functions
Window functions allow you to perform calculations across rows related to the current row, with the need to aggregate data.
When to Use Them:
- Ranking
- Moving averages
- Cumulative (or running) totals
- Period-to-period comparison (e.g., month-to-month sales)
- Percentile calculations
- Opening and closing prices
- Time series analysis
5. Set Operators
Set operators are used to combine the results of two or more SELECT queries into a single output. They are:
- UNION: Combines the queries’ outputs and removes duplicate rows.
- UNION ALL: Combines the queries’ outputs, including duplicates.
- INTERSECT: Returns only rows present in all queries’ outputs.
- EXCEPT (or MINUS): Returns only rows appearing in the first query’s output but not others.
When to Use Them:
- Comparing datasets
- Filtering results across multiple tables
6. GROUP BY Extensions
GROUP BY is a standard SQL clause for data aggregation. However, you can perform more complex groupings using these GROUP BY extensions:
- GROUPING SETS: For multiple groupings in a single GROUP BY.
- ROLLUP: For creating subtotals and grand totals in a single query.
- CUBE: For creating all possible combinations of aggregations for columns in GROUP BY, including subtotals for each level and a grand total.
When to Use Them:
- Hierarchical summaries
- Multi-dimensional analysis
- Generating various aggregate views
7. String Functions
Complex data often includes textual fields that require manipulation to extract insights or prepare for analysis. In SQL, many string functions help you with that, such as:
- TRIM(): Removes leading and/or trailing characters from a string.
- REPLACE(): Substitutes all occurrences of a substring within a string with a new substring.
- SUBSTRING(): Extracts a part of a string from a specified starting position and length.
- LIKE: Allows pattern matching within strings using wildcard characters, such as % and _.
- PATINDEX(): Returns the starting position of a pattern within a string or zero if the pattern is not found.
- RegEx: Provides a way to search, match, and manipulate strings based on complex patterns.
- SPLIT_PART(): Splits a string by a delimiter and returns a particular segment based on an index.
- STRING_AGG(): Concatenates values from multiple rows into a single string, separated by a specified delimiter.
When to Use Them:
- Data cleaning
- Pattern matching
- Text parsing and tokenization
- Text aggregation
Conclusion
Consider learning and incorporating these SQL techniques into your data manipulation techniques in data science projects. They will definitely satisfy many of your needs.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.
- Top 5 Free Resources for Learning Advanced SQL Techniques
- Advanced Feature Selection Techniques for Machine Learning Models
- 10 Advanced Git Techniques
- 3 Research-Driven Advanced Prompting Techniques for LLM Efficiency…
- Top 10 Advanced Data Science SQL Interview Questions You Must Know…
- 10 Pandas One Liners for Data Access, Manipulation, and Management
Latest Posts
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- A Complete Guide to Seaborn
- 10 Command-Line Tools Every Data Scientist Should Know
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
Top Posts |
---|
- 5 Fun AI Agent Projects for Absolute Beginners
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- 10 Command-Line Tools Every Data Scientist Should Know
- A Gentle Introduction to TypeScript for Python Programmers
- A Complete Guide to Seaborn
- From Excel to Python: 7 Steps Analysts Can Take Today
- A Gentle Introduction to MCP Servers and Clients