A primary key is a constraint that uniquely identifies each record in a table. Primary keys must contain unique values and cannot contain NULL
values. A table can have only one primary key, which may consist of single or multiple fields.
Example:
CREATE TABLE Users ( UserID INT PRIMARY KEY, UserName VARCHAR(50) NOT NULL, Email VARCHAR(100) UNIQUE );
In this table, UserID
is the primary key.
A foreign key is a key used to link two tables together. It's a field (or collection of fields) in one table that refers to the PRIMARY KEY
in another table. The table with the foreign key is called the child table, and the table with the primary key is called the referenced or parent table.
Example:
Let's create an Orders
table with a foreign key that references the Users
table.
CREATE TABLE Orders ( OrderID INT PRIMARY KEY, OrderDate DATE, UserID INT, FOREIGN KEY (UserID) REFERENCES Users(UserID) );
Here, UserID
in the Orders
table is a foreign key that points to the UserID
primary key in the Users
table, establishing a link between the two tables.
Feature | PRIMARY KEY | UNIQUE KEY |
---|---|---|
Null Values | Does not allow null values | Allows one null value |
Number per Table | Only one per table | Can have multiple per table |
Clustered Index | Creates a clustered index by default | Creates a non-clustered index by default |
A database is a collection of structured data, stored and accessed electronically. It allows for efficient data management, retrieval, and updating.
A Database Management System (DBMS) is software that interacts with end-users, applications, and the database itself to capture and analyze the data. A DBMS allows users to create, read, update, and delete data in a database.
A table is a data structure that organizes information into rows and columns. Each column represents a specific attribute, and each row represents a record.
Example:
CREATE TABLE Users ( UserID INT PRIMARY KEY, UserName VARCHAR(50) NOT NULL, Email VARCHAR(100) UNIQUE );
A view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database. Views can be used to simplify complex queries, provide a layer of security by restricting access to data, and present data in a different format.
Example:
CREATE VIEW UserOrders AS SELECT u.UserName, o.OrderDate FROM Users u JOIN Orders o ON u.UserID = o.UserID;
Now, you can query this view like a regular table:
SELECT * FROM UserOrders;
A database schema is the skeleton structure that represents the logical view of the entire database. It defines how the data is organized and how the relations among them are associated.
Feature | Table | View |
---|---|---|
Data Storage | Stores data physically in the database | Does not store data physically |
Nature | A real, physical entity | A virtual table based on a query |
Modification | Can be modified (INSERT, UPDATE, DELETE) | Limited modification capabilities |
Constraints are rules that are enforced on data columns on a table. These are used to limit the type of data that can go into a table. This ensures the accuracy and reliability of the data in the database.
Some common constraints are:
NOT NULL
: Ensures that a column cannot have a NULL
value.UNIQUE
: Ensures that all values in a column are different.PRIMARY KEY
: A combination of a NOT NULL
and UNIQUE
. Uniquely identifies each row in a table.FOREIGN KEY
: Uniquely identifies a row/record in another table.CHECK
: Ensures that all values in a column satisfy a specific condition.DEFAULT
: Sets a default value for a column when no value is specified.Example:
CREATE TABLE Products ( ProductID INT PRIMARY KEY, ProductName VARCHAR(100) NOT NULL, SKU VARCHAR(50) UNIQUE, Price DECIMAL(10, 2) CHECK (Price > 0), Status VARCHAR(20) DEFAULT 'In Stock' );
This table includes:
PRIMARY KEY
on ProductID
.NOT NULL
constraint on ProductName
.UNIQUE
constraint on SKU
.CHECK
constraint to ensure Price
is positive.DEFAULT
value for the Status
column.Data integrity is the maintenance of, and the assurance of the accuracy and consistency of, data over its entire life-cycle. It is a critical aspect to the design, implementation and usage of any system which stores, processes, or retrieves data.
SQL commands are divided into four main categories:
CREATE
: Creates database objects like tables, views, or indexes.
CREATE TABLE Employees (ID INT, Name VARCHAR(100));
ALTER
: Modifies the structure of an existing database object.
ALTER TABLE Employees ADD Email VARCHAR(100);
DROP
: Deletes an entire database object.
DROP TABLE Employees;
TRUNCATE
: Removes all records from a table, but not the table itself. This is faster than DELETE
and cannot be rolled back.
TRUNCATE TABLE Employees;
SELECT
: Retrieves data from a table.
SELECT Name, Email FROM Employees WHERE ID = 1;
INSERT
: Adds new rows of data into a table.
INSERT INTO Employees (ID, Name, Email) VALUES (1, 'John Doe', 'john.doe@example.com');
UPDATE
: Modifies existing records in a table.
UPDATE Employees SET Email = 'new.email@example.com' WHERE ID = 1;
DELETE
: Removes existing records from a table.
DELETE FROM Employees WHERE ID = 1;
GRANT
: Gives a specific user permissions to perform certain tasks.
GRANT SELECT, UPDATE ON Employees TO 'some_user'@'localhost';
REVOKE
: Removes permissions from a user.
REVOKE UPDATE ON Employees FROM 'some_user'@'localhost';
COMMIT
: Saves all the changes made in the current transaction.
COMMIT;
ROLLBACK
: Undoes all the changes made in the current transaction.
ROLLBACK;
SAVEPOINT
: Sets a point within a transaction to which you can later roll back.
SAVEPOINT my_savepoint; ROLLBACK TO my_savepoint;
Command | DELETE | TRUNCATE | DROP |
---|---|---|---|
Type | DML | DDL | DDL |
Scope | Removes one or more rows from a table | Removes all rows from a table | Removes the entire table |
WHERE clause | Can be used | Cannot be used | Cannot be used |
Rollback | Can be rolled back | Cannot be rolled back | Cannot be rolled back |
Speed | Slower | Faster | Fastest |
WHERE
clause: Used to filter records before any groupings are made. It operates on a row-by-row basis.SELECT * FROM Employees WHERE DepartmentID = 101;
HAVING
clause: Used to filter groups after the GROUP BY
clause has been applied. It operates on aggregated data.Suppose you want to find departments with more than 2 employees.
SELECT DepartmentName, COUNT(*) AS EmployeeCount FROM Employees GROUP BY DepartmentName HAVING COUNT(*) > 2;
The GROUP BY
clause is used with aggregate functions (like COUNT
, MAX
, MIN
, SUM
, AVG
) to group the result-set by one or more columns.
SELECT DepartmentName, COUNT(*) AS EmployeeCount FROM Employees GROUP BY DepartmentName;
The ORDER BY
clause is used to sort the result-set in ascending (ASC
) or descending (DESC
) order. By default, it sorts the records in ascending order.
SELECT * FROM Employees ORDER BY Name ASC;
The MERGE
statement (also known as UPSERT
) is used to perform INSERT
, UPDATE
, or DELETE
operations on a target table based on the results of a join with a source table.
For example, if you want to insert a new employee into the Employees
table, but if the employee already exists, you want to update the employee's information, you can use the MERGE
statement.
MERGE INTO Employees e USING (SELECT 1 AS ID, 'John Doe' AS Name, 101 AS DepartmentID FROM DUAL) s ON (e.ID = s.ID) WHEN MATCHED THEN UPDATE SET e.Name = s.Name, e.DepartmentID = s.DepartmentID WHEN NOT MATCHED THEN INSERT (ID, Name, DepartmentID) VALUES (s.ID, s.Name, s.DepartmentID);
A SAVEPOINT
is a point in a transaction that you can roll back to without rolling back the entire transaction. This is useful for implementing complex error handling logic.
GRANT
: Gives a user specific permissions on database objects.REVOKE
: Removes user permissions on database objects.The ALTER TABLE
statement is used to add, delete, or modify columns in an existing table. It can also be used to add and drop various constraints on an existing table.
ALTER TABLE Employees ADD Email VARCHAR(100);
UNION
: Combines the result sets of two or more SELECT
statements and removes duplicate rows.INTERSECT
: Returns the common rows from two or more SELECT
statements.SELECT * FROM Employees WHERE DepartmentID = 101 UNION SELECT * FROM Employees WHERE DepartmentID = 102;
SELECT * FROM Employees WHERE DepartmentID = 101 INTERSECT SELECT * FROM Employees WHERE DepartmentID = 102;
The LIKE
operator is used in a WHERE
clause to search for a specified pattern in a column.
%
: Represents zero, one, or multiple characters._
: Represents a single character.SELECT * FROM Employees WHERE Name LIKE 'A%';
The CASE
statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement). Once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE
clause.
SELECT EmployeeName, CASE WHEN DepartmentID = 101 THEN 'HR' WHEN DepartmentID = 102 THEN 'Engineering' ELSE 'Marketing' END AS DepartmentName FROM Employees;
SQL JOIN
clauses are used to combine rows from two or more tables based on a related column between them. It's the primary way to query data that has been normalized across multiple tables.
Let's use two sample tables to illustrate the different types of joins: Employees
and Departments
.
Employees
Table:
| EmployeeID | EmployeeName | DepartmentID |
|------------|--------------|--------------|
| 1 | Alice | 101 |
| 2 | Bob | 102 |
| 3 | Charlie | 101 |
| 4 | David | NULL |
Departments
Table:
| DepartmentID | DepartmentName |
|--------------|----------------|
| 101 | HR |
| 102 | Engineering |
| 103 | Marketing |
INNER JOIN
Returns records that have matching values in both tables. This is the most common type of join.
Query:
SELECT e.EmployeeName, d.DepartmentName FROM Employees e INNER JOIN Departments d ON e.DepartmentID = d.DepartmentID;
Result:
| EmployeeName | DepartmentName |
|--------------|----------------|
| Alice | HR |
| Bob | Engineering |
| Charlie | HR |
Notice that David (no department) and the Marketing department (no employees) are not included.
LEFT JOIN
(or LEFT OUTER JOIN
)Returns all records from the left table (Employees
), and the matched records from the right table (Departments
). The result is NULL
from the right side if there is no match.
Query:
SELECT e.EmployeeName, d.DepartmentName FROM Employees e LEFT JOIN Departments d ON e.DepartmentID = d.DepartmentID;
Result:
| EmployeeName | DepartmentName |
|--------------|----------------|
| Alice | HR |
| Bob | Engineering |
| Charlie | HR |
| David | NULL |
Notice that David is included, as he is in the left table.
RIGHT JOIN
(or RIGHT OUTER JOIN
)Returns all records from the right table (Departments
), and the matched records from the left table (Employees
). The result is NULL
from the left side if there is no match.
Query:
SELECT e.EmployeeName, d.DepartmentName FROM Employees e RIGHT JOIN Departments d ON e.DepartmentID = d.DepartmentID;
Result:
| EmployeeName | DepartmentName |
|--------------|----------------|
| Alice | HR |
| Bob | Engineering |
| Charlie | HR |
| NULL | Marketing |
Notice that the Marketing department is included, as it is in the right table.
FULL OUTER JOIN
Returns all records when there is a match in either the left or the right table. It's a combination of LEFT JOIN
and RIGHT JOIN
.
Query:
SELECT e.EmployeeName, d.DepartmentName FROM Employees e FULL OUTER JOIN Departments d ON e.DepartmentID = d.DepartmentID;
Result:
| EmployeeName | DepartmentName |
|--------------|----------------|
| Alice | HR |
| Bob | Engineering |
| Charlie | HR |
| David | NULL |
| NULL | Marketing |
Notice that both David and the Marketing department are included.
CROSS JOIN
Returns the Cartesian product of the two tables, meaning it returns all possible combinations of rows from both tables.
Query:
SELECT e.EmployeeName, d.DepartmentName FROM Employees e CROSS JOIN Departments d;
Result: (A few example rows)
| EmployeeName | DepartmentName |
|--------------|----------------|
| Alice | HR |
| Alice | Engineering |
| Alice | Marketing |
| Bob | HR |
| Bob | Engineering |
| ... | ... |
This join type is used less frequently and does not require an ON
clause.
UNION
: Combines the result sets of two or more SELECT
statements and removes duplicate rows.UNION ALL
: Combines the result sets of two or more SELECT
statements and includes all rows, including duplicates.UNION ALL
is faster than UNION
because it doesn't have to check for duplicates.
SELECT * FROM Employees WHERE DepartmentID = 101 UNION SELECT * FROM Employees WHERE DepartmentID = 102;
SELECT * FROM Employees WHERE DepartmentID = 101 UNION ALL SELECT * FROM Employees WHERE DepartmentID = 102;
A self-join is a regular join, but the table is joined with itself. This is useful for querying hierarchical data or comparing rows within the same table.
SELECT e1.EmployeeName, e2.EmployeeName FROM Employees e1 JOIN Employees e2 ON e1.ManagerID = e2.EmployeeID;
A natural join is a type of join that creates an implicit join clause for you based on the common columns in the two tables being joined. Common columns are columns that have the same name in both tables.
INNER JOIN
: Returns only the rows that have matching values in both tables.LEFT JOIN
: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL
on the right side.A CROSS JOIN
returns the Cartesian product of two tables. This means it returns all possible combinations of rows from both tables. It is used when you want to create a combination of every row from two tables.
The INTERSECT
operator is used to return the results of two or more SELECT
statements. However, it only returns the rows that are common to all SELECT
statements.
The EXCEPT
operator is used to return the results of the first SELECT
statement that are not present in the result of the second SELECT
statement.
Yes, you can join multiple tables in a single query. You can do this by chaining JOIN
clauses. For example, you can join TableA
with TableB
, and then join the result with TableC
.
SELECT * FROM TableA JOIN TableB ON TableA.ID = TableB.ID JOIN TableC ON TableB.ID = TableC.ID;
SELECT * FROM TableA WHERE ID IN (SELECT ID FROM TableB);
INNER JOIN
whenever possible, as it is the most efficient type of join.ON
clause to specify the join condition.A FULL OUTER JOIN
returns all rows from both tables, with NULL
values in columns where there is no match. It is a combination of LEFT JOIN
and RIGHT JOIN
.
Database normalization is the process of organizing the columns and tables of a relational database to minimize data redundancy. It involves dividing a database into two or more tables and defining relationships between the tables. The goal is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using defined relationships.
Normal Forms:
Denormalization is the process of intentionally introducing redundancy into a table to improve query performance. This is often done in data warehousing for reporting purposes.
Let's walk through the normalization process with a practical example. Imagine we have a single table to store order information for a small shop.
Unnormalized Table (OrderData
)
This table contains repeating groups and redundant data.
OrderID | CustomerDetails | ProductDetails |
---|---|---|
1 | 101, John Doe | (P01, Apple, 10), (P02, Banana, 5) |
2 | 102, Jane Smith | (P03, Cherry, 20) |
This design has several issues:
CustomerDetails
and ProductDetails
columns contain multiple values.CustomerName
is repeated for every order the customer makes.Rule: Each cell must hold a single, atomic value, and each record must be unique.
To achieve 1NF, we break the repeating groups into separate rows. This gives us a single record for each product in an order.
Orders_1NF
Table (Primary Key: (OrderID, ProductID)
)
OrderID | CustomerID | CustomerName | ProductID | ProductName | Quantity |
---|---|---|---|---|---|
1 | 101 | John Doe | P01 | Apple | 10 |
1 | 101 | John Doe | P02 | Banana | 5 |
2 | 102 | Jane Smith | P03 | Cherry | 20 |
The table is now in 1NF, but it still has issues:
CustomerName
is repeated for each product in an order.CustomerName
depends only on CustomerID
(which depends on OrderID
), not the full primary key (OrderID, ProductID)
. ProductName
depends only on ProductID
.Rule: The table must be in 1NF, and every non-key attribute must be fully dependent on the entire primary key. This rule applies to tables with composite primary keys.
We split the Orders_1NF
table to remove these partial dependencies.
OrderItems
Table: Contains information specific to each product in an order.
(Primary Key: (OrderID, ProductID)
)
OrderID | ProductID | Quantity |
---|---|---|
1 | P01 | 10 |
1 | P02 | 5 |
2 | P03 | 20 |
Orders
Table: Contains information about the order itself.
(Primary Key: OrderID
)
OrderID | CustomerID | CustomerName |
---|---|---|
1 | 101 | John Doe |
2 | 102 | Jane Smith |
Products
Table: Contains product information.
(Primary Key: ProductID
)
ProductID | ProductName |
---|---|
P01 | Apple |
P02 | Banana |
P03 | Cherry |
Now we are in 2NF, but there's still a problem in the Orders
table:
CustomerName
depends on CustomerID
, which is not the primary key. The dependency is OrderID
-> CustomerID
-> CustomerName
.Rule: The table must be in 2NF, and there should be no transitive dependencies (where a non-key attribute depends on another non-key attribute).
We split the Orders
table to remove the transitive dependency.
Customers
Table:
(Primary Key: CustomerID
)
CustomerID | CustomerName |
---|---|
101 | John Doe |
102 | Jane Smith |
Orders_Final
Table:
(Primary Key: OrderID
, Foreign Key: CustomerID
)
OrderID | CustomerID |
---|---|
1 | 101 |
2 | 102 |
After normalization, we have four separate, well-structured tables.
Customers
CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, CustomerName VARCHAR(100) );
Products
CREATE TABLE Products ( ProductID VARCHAR(10) PRIMARY KEY, ProductName VARCHAR(100) );
Orders
CREATE TABLE Orders ( OrderID INT PRIMARY KEY, CustomerID INT, FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID) );
OrderItems
CREATE TABLE OrderItems ( OrderID INT, ProductID VARCHAR(10), Quantity INT, PRIMARY KEY (OrderID, ProductID), FOREIGN KEY (OrderID) REFERENCES Orders(OrderID), FOREIGN KEY (ProductID) REFERENCES Products(ProductID) );
This design minimizes redundancy and prevents data anomalies.
Boyce-Codd Normal Form (BCNF) is a higher version of 3NF. A table is in BCNF if for every non-trivial functional dependency X -> Y
, X
is a superkey.
There are three types of relationships in a database:
A data model is a conceptual representation of data objects, the associations between different data objects, and the rules. Data models are used to define how data is stored, connected, accessed, and processed within a database management system.
An Entity-Relationship (ER) diagram is a flowchart that illustrates how "entities" such as people, objects, or concepts relate to each other within a system. ER diagrams are most often used to design or debug relational databases.
The main purpose of normalization is to:
A composite key is a primary key that consists of two or more columns. Each column in the composite key may not be unique by itself, but the combination of the columns is unique.
An attribute is a property or characteristic of an entity. In a relational database, attributes are represented by columns in a table.
A relationship is an association between two or more entities. In a relational database, relationships are established by using foreign keys.
An index is a special lookup table that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.
Creating an index on a column creates another data structure which holds the column value, and a pointer to the record it relates to. This index structure is then sorted, allowing for fast searches.
Types of Indexes:
PRIMARY KEY
, a clustered index is often created by default on that column. A table can have only one clustered index.-- Creates a simple non-clustered index on the LastName column CREATE INDEX IX_Employees_LastName ON Employees(LastName);
-- Ensures that no two employees have the same email CREATE UNIQUE INDEX UQ_Employees_Email ON Employees(Email);
-- Creates an index on both LastName and FirstName CREATE INDEX IX_Employees_LastName_FirstName ON Employees(LastName, FirstName);
Query optimization is the process of choosing the most efficient way to execute a SQL statement. The database's query optimizer generates several execution plans for a query and selects the one with the lowest estimated cost.
An execution plan is a sequence of steps that the database's query optimizer generates to access data in the database. It shows how the database will execute a query, including the order in which tables are accessed, the join methods used, and the access methods (e.g., index scan, index seek).
Index Scan (or Table Scan): If you run a query like SELECT * FROM Employees WHERE LastName = 'Smith';
and there is no index on the LastName
column, the database has no choice but to check every single row in the table to see if the last name is 'Smith'. This is called a full table scan. It's the slowest way to find data, like reading a book from cover to cover to find a single word. If there is an index, but it's not selective enough, the optimizer might still choose to scan the entire index, which is an index scan.
Index Seek: Now, let's say you create an index on the LastName
column.
CREATE INDEX IX_Employees_LastName ON Employees(LastName);
When you run the same query again, the database's query optimizer will likely use the index. It will perform an index seek, which uses the sorted B-tree structure of the index to navigate directly to the rows where LastName
is 'Smith'. This is incredibly fast and efficient, like using the index at the back of a book.
Operation | Analogy | Performance |
---|---|---|
Table Scan | Reading a book from the first page to the last | Very Slow |
Index Seek | Using the index at the back of a book | Very Fast |
A covering index is a special type of index that contains all the columns needed to satisfy a query's SELECT
, JOIN
, and WHERE
clauses. Because the index itself contains all the necessary data, the database doesn't need to perform an extra lookup into the actual table data.
Example: Consider this query:
SELECT EmployeeID, LastName, FirstName FROM Employees WHERE LastName = 'Jones';
Standard Index: If you only have an index on LastName
, the database will perform an index seek to find all the 'Jones' records. However, it then needs to perform an additional step (a "key lookup" or "bookmark lookup") to go back to the table to retrieve the EmployeeID
and FirstName
for each of those rows.
Covering Index: To optimize this, you can create a covering index that includes all the required columns.
-- The INCLUDE clause adds non-key columns to the index CREATE INDEX IX_Covering_LastName ON Employees(LastName) INCLUDE (EmployeeID, FirstName);
Now, when you run the query, the database can get everything it needs (EmployeeID
, LastName
, FirstName
) directly from the index pages. It never has to touch the main table data, which can lead to a significant performance boost.
Indexes can slow down INSERT
, UPDATE
, and DELETE
operations because the database has to update the indexes as well as the table data. Therefore, it's important to have a balance between the number of indexes on a table and the performance of data modification operations.
Statistics are objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view. The query optimizer uses these statistics to estimate the cardinality, or number of rows, in the query result.
Parameter sniffing is a feature of SQL Server where the query optimizer "sniffs" the initial values of parameters in a stored procedure and creates an execution plan based on those values. This can sometimes lead to performance problems if the initial parameter values are not representative of the typical values.
WHERE
clauses and join conditions.SELECT *
. Instead, specify the columns you need.JOIN
s instead of subqueries where possible.EXISTS
instead of IN
for subqueries.WHERE
clauses.A filtered index is a type of non-clustered index that is created on a subset of rows in a table. This can improve query performance, reduce index maintenance costs, and reduce index storage costs compared with full-table indexes.
A heap is a table without a clustered index. Data rows are not stored in any particular order.
ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. These are the four properties of a transaction that ensure data integrity.
Transaction isolation levels define the degree to which one transaction must be isolated from the data modifications made by other transactions. The SQL standard defines four levels:
Read Uncommitted
: The lowest level. A transaction can read data that is not yet committed by other transactions ("dirty reads").Read Committed
: Prevents dirty reads. A transaction can only read data that has been committed. This is the default level for many databases (e.g., PostgreSQL, SQL Server).Repeatable Read
: Prevents dirty reads and non-repeatable reads. It ensures that if a transaction reads a row once, it will read the same data if it reads that row again.Serializable
: The highest level. Prevents dirty reads, non-repeatable reads, and phantom reads. It does this by ensuring that transactions execute as if they were running one after another (serially), not concurrently. This level has the highest performance overhead.Here's a summary of which isolation level prevents which concurrency problem:
Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
---|---|---|---|
Read Uncommitted | Possible | Possible | Possible |
Read Committed | No | Possible | Possible |
Repeatable Read | No | No | Possible |
Serializable | No | No | No |
Setting the isolation level (example for SQL Server):
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN TRANSACTION; -- Your SQL statements here COMMIT;
A transaction is a single logical unit of work that consists of one or more SQL statements. It's treated as an "all or nothing" operation. If any part of the transaction fails, the entire transaction is rolled back, and the database is left in the state it was in before the transaction started.
Think of transferring money from a savings account to a checking account. This requires two operations:
If the debit succeeds but the credit fails, you've lost money. A transaction wraps these two operations into a single atomic unit, ensuring that either both succeed or neither does.
Example:
-- Let's assume we have an Accounts table with AccountID and Balance BEGIN TRANSACTION; -- 1. Debit $100 from savings (AccountID 1) UPDATE Accounts SET Balance = Balance - 100 WHERE AccountID = 1; -- 2. Credit $100 to checking (AccountID 2) UPDATE Accounts SET Balance = Balance + 100 WHERE AccountID = 2; -- Check for any errors. If everything is okay, commit the transaction. -- If there was an error, we would issue a ROLLBACK. COMMIT; -- To undo the changes before committing: -- ROLLBACK;
A deadlock is a situation where two or more transactions are waiting for each other to release locks, causing a stalemate. Deadlocks can be prevented by:
Locking is a mechanism used by database management systems to protect the integrity of data. When a transaction accesses a piece of data, it can acquire a lock on that data to prevent other transactions from modifying it.
A dirty read occurs when a transaction reads data that has been modified by another transaction but not yet committed. This is problematic because the first transaction might roll back its changes, meaning the data that was read is now invalid.
Scenario:
This is prevented by the Read Committed
isolation level and higher.
A non-repeatable read occurs when a transaction reads the same row twice but gets different data each time because another transaction committed an update in between the two reads.
Scenario:
This is prevented by the Repeatable Read
isolation level and higher.
A phantom read occurs when a transaction runs a query twice and the second result includes new rows that weren't there the first time. This happens when another transaction inserts new rows that match the query's WHERE
clause and commits.
Scenario:
SELECT COUNT(*) FROM Employees WHERE Department = 'Sales';
It gets a result of 15.Sales
department and commits.COUNT(*)
query again. It now gets a result of 16.This is the most difficult concurrency issue to prevent and is handled by the Serializable
isolation level.
A SAVEPOINT
is a mechanism that allows you to roll back a transaction to a specific point without rolling back the entire transaction. This is useful for implementing complex logic where you might need to undo parts of the transaction based on certain conditions.
Example: Imagine you are inserting records for a new order. You insert the main order record, and then several order items. If one of the items is invalid, you only want to undo the insertion of that specific item, not the whole order.
BEGIN TRANSACTION; -- Insert the main order record INSERT INTO Orders (OrderID, CustomerID) VALUES (101, 77); -- Create a savepoint before adding items SAVEPOINT BeforeItems; -- Try to insert the first item INSERT INTO OrderItems (OrderID, ProductID, Quantity) VALUES (101, 'PROD-A', 10); -- Let's say the second item is invalid (e.g., 'PROD-X' does not exist) -- An error would occur here. Let's simulate catching it. -- If an error is detected, roll back to the savepoint ROLLBACK TO BeforeItems; -- You can now continue the transaction, perhaps trying a different item -- or just committing the main order record. -- Commit the valid parts of the transaction COMMIT;
A two-phase commit is a protocol used to ensure that all participating databases in a distributed transaction either all commit or all roll back. It consists of two phases:
BEGIN TRANSACTION
statement and ended with a COMMIT
or ROLLBACK
statement.A subquery, also known as an inner query or nested query, is a query within another SQL query and embedded inside the WHERE
clause.
A Common Table Expression (CTE) is a temporary, named result set that you can reference within a SELECT
, INSERT
, UPDATE
, or DELETE
statement. CTEs are defined using the WITH
clause and are used to break down complex queries into simpler, more readable logical blocks. A CTE is only valid in the query that it is defined for.
Example: Let's find all employees who earn more than the average salary of their respective departments.
-- Sample Data: -- Employees (ID, Name, Department, Salary) -- Departments (ID, Name) WITH DepartmentAvgSalary AS ( -- First, calculate the average salary for each department SELECT Department, AVG(Salary) as AvgSalary FROM Employees GROUP BY Department ) -- Then, join the CTE back to the Employees table SELECT e.Name, e.Salary, d.AvgSalary FROM Employees e JOIN DepartmentAvgSalary d ON e.Department = d.Department WHERE e.Salary > d.AvgSalary;
This makes the logic much clearer than nesting subqueries.
Window functions perform a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row.
A recursive CTE is a common table expression (CTE) that references itself. It is used to query hierarchical data, such as organizational charts, bill of materials, or file system trees.
A recursive CTE has three parts:
UNION ALL
: The operator that combines the anchor with the recursive member.Example:
Let's find the entire reporting chain for an employee named 'Alice'.
Assume an Employees
table: (EmployeeID, Name, ManagerID)
WITH RECURSIVE EmployeeHierarchy AS ( -- 1. Anchor Member: Select the starting employee (Alice) SELECT EmployeeID, Name, ManagerID, 0 as Level FROM Employees WHERE Name = 'Alice' UNION ALL -- 2. Recursive Member: Join Employees to the CTE SELECT e.EmployeeID, e.Name, e.ManagerID, eh.Level + 1 FROM Employees e INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID ) SELECT * FROM EmployeeHierarchy;
Note: Some SQL dialects use WITH RECURSIVE
, while others (like SQL Server) do not require the RECURSIVE
keyword.
These are all window functions used for ranking rows, but they handle ties differently. Let's see them in action.
Scenario: Rank employees by salary within each department.
Assume an Employees
table with Name
, Department
, Salary
.
Query:
SELECT Name, Department, Salary, RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) as RankNum, DENSE_RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) as DenseRankNum, ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) as RowNum FROM Employees;
Result: Let's look at the 'Sales' department from the results.
| Name | Department | Salary | RankNum | DenseRankNum | RowNum |
|---------|------------|--------|---------|--------------|--------|
| Alice | Sales | 90000 | 1 | 1 | 1 |
| Bob | Sales | 85000 | 2 | 2 | 2 |
| Charlie | Sales | 85000 | 2 | 2 | 3 |
| David | Sales | 70000 | 4 | 3 | 4 |
Explanation:
ROW_NUMBER()
: Assigns a unique, sequential number to each row. It doesn't care about ties. (1, 2, 3, 4)RANK()
: Assigns the same rank to tied rows. It then skips the next rank(s). Notice how it goes from rank 2 to rank 4 because two employees are tied for 2nd place. (1, 2, 2, 4)DENSE_RANK()
: Also assigns the same rank to tied rows, but it does not skip any ranks. The next rank after the tie is 3. (1, 2, 2, 3)LEAD()
and LAG()
are window functions that give you access to a row at a specific physical offset which comes before or after the current row.
LAG()
: "Looks back" at the previous row.LEAD()
: "Looks forward" to the next row.Scenario: For each sale, show the sale amount from the previous and next sale for that same customer.
Assume a Sales
table with Customer
, SaleDate
, SaleAmount
.
Query:
SELECT Customer, SaleDate, SaleAmount, LAG(SaleAmount, 1, 0) OVER (PARTITION BY Customer ORDER BY SaleDate) as PreviousSale, LEAD(SaleAmount, 1, 0) OVER (PARTITION BY Customer ORDER BY SaleDate) as NextSale FROM Sales;
The PARTITION BY
clause treats each customer's sales history independently. The 1
is the offset (how many rows to look back/forward), and the 0
is the default value if no previous/next row exists.
Result:
| Customer | SaleDate | SaleAmount | PreviousSale | NextSale |
|----------|------------|------------|--------------|----------|
| Alice | 2023-01-15 | 100 | 0 | 150 |
| Alice | 2023-02-10 | 150 | 100 | 120 |
| Alice | 2023-03-05 | 120 | 150 | 0 |
| Bob | 2023-01-20 | 200 | 0 | 250 |
| Bob | 2023-02-25 | 250 | 200 | 0 |
This is very useful for calculating period-over-period changes, like month-over-month growth.
Pivoting is an operation that transforms data from a row-level representation to a columnar representation. It "rotates" rows into columns. For example, you might want to turn a list of monthly sales into a single row with a column for each month.
While some databases have a PIVOT
keyword (e.g., SQL Server), a more standard and portable way to do this is with conditional aggregation using CASE
statements.
Scenario: Transform a list of quarterly product sales into a summary table.
Assume a QuarterlySales
table: (Product, Quarter, Sales)
Original Data:
| Product | Quarter | Sales |
|----------|---------|-------|
| Apple | Q1 | 1000 |
| Apple | Q2 | 1200 |
| Banana | Q1 | 800 |
| Banana | Q2 | 900 |
Query using Conditional Aggregation:
SELECT Product, SUM(CASE WHEN Quarter = 'Q1' THEN Sales ELSE 0 END) AS Q1_Sales, SUM(CASE WHEN Quarter = 'Q2' THEN Sales ELSE 0 END) AS Q2_Sales, SUM(CASE WHEN Quarter = 'Q3' THEN Sales ELSE 0 END) AS Q3_Sales, SUM(CASE WHEN Quarter = 'Q4' THEN Sales ELSE 0 END) AS Q4_Sales FROM QuarterlySales GROUP BY Product;
Pivoted Result:
| Product | Q1_Sales | Q2_Sales | Q3_Sales | Q4_Sales |
|----------|----------|----------|----------|----------|
| Apple | 1000 | 1200 | 0 | 0 |
| Banana | 800 | 900 | 0 | 0 |
An unpivot is the reverse operation of a pivot. It transforms data from a columnar representation back to a row-level representation.
Scenario: Let's unpivot the result from the previous example back to its original format.
Query using UNION ALL
:
This is one of the most common ways to unpivot data in a standard way.
SELECT Product, 'Q1' as Quarter, Q1_Sales as Sales FROM PivotedSales UNION ALL SELECT Product, 'Q2' as Quarter, Q2_Sales as Sales FROM PivotedSales UNION ALL SELECT Product, 'Q3' as Quarter, Q3_Sales as Sales FROM PivotedSales UNION ALL SELECT Product, 'Q4' as Quarter, Q4_Sales as Sales FROM PivotedSales WHERE Sales > 0; -- Optional: to remove zero-value rows
This would return the data to its original, normalized, row-based format.
A user-defined function (UDF) is a routine that you can define and then call from within SQL queries, just like a built-in function. UDFs are typically used for encapsulating custom calculations or formatting logic.
Example (T-SQL): This function takes a price and a discount percentage and returns the final discounted price.
CREATE FUNCTION dbo.GetDiscountedPrice(@price DECIMAL(10, 2), @discount DECIMAL(3, 2)) RETURNS DECIMAL(10, 2) AS BEGIN RETURN @price * (1 - @discount); END;
Usage:
SELECT ProductName, Price, dbo.GetDiscountedPrice(Price, 0.10) AS DiscountedPrice FROM Products;
A stored procedure is a set of SQL statements that are stored in the database as an object. It can be called by applications or users to perform a specific task. They are used to encapsulate and centralize business logic, improve performance, and enhance security.
Example (T-SQL): This stored procedure retrieves all employees from a specific department.
CREATE PROCEDURE dbo.GetEmployeesByDepartment @DepartmentName VARCHAR(100) AS BEGIN SET NOCOUNT ON; SELECT EmployeeID, Name, Salary FROM Employees WHERE Department = @DepartmentName; END;
Usage:
EXEC dbo.GetEmployeesByDepartment @DepartmentName = 'Sales';
Feature | Stored Procedure | User-Defined Function (UDF) |
---|---|---|
Invocation | Can be called independently (EXEC ) | Must be used in an SQL statement (SELECT ) |
Return Value | Can return multiple result sets, or none | Must return a single value (scalar or table) |
Transaction | Can contain its own transaction (BEGIN /COMMIT ) | Cannot contain its own transaction |
Modification | Can modify the database state (INSERT , UPDATE ) | Generally cannot modify the database state |
Primary Use | Executing business logic | Reusing calculations or formatting |
SQL injection is a web security vulnerability that allows an attacker to interfere with the queries that an application makes to its database. It generally allows an attacker to view data that they are not normally able to retrieve.
The Vulnerability: It happens when application code constructs SQL queries by concatenating user input directly into the query string.
Vulnerable Code Example (in Python):
# User input is directly inserted into the query string user_id = "105 OR 1=1" query = "SELECT * FROM Users WHERE UserID = " + user_id # The resulting SQL is: # SELECT * FROM Users WHERE UserID = 105 OR 1=1 # This will return ALL users, not just one!
Prevention: Use Parameterized Queries (Prepared Statements) This is the most effective way to prevent SQL injection. The database treats the user input as data, not as executable code.
Secure Code Example (in Python with psycopg2):
# The '?' or '%s' is a placeholder. The value is sent separately. user_id = "105 OR 1=1" query = "SELECT * FROM Users WHERE UserID = %s" # The database driver safely handles the parameter. cursor.execute(query, (user_id,)) # The database sees the query and the data separately. # It will correctly (and safely) look for a user with the literal ID "105 OR 1=1" and find nothing.
Other prevention methods include using stored procedures and validating/sanitizing user input, but parameterized queries are the primary defense.
The NTILE(n)
function divides the rows in a partition into n
groups, as equal in size as possible, and assigns an integer from 1 to n
to each group.
The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
BASE (Basically Available, Soft state, Eventual consistency) is a consistency model used by many NoSQL databases.
Eventual consistency is a consistency model used in distributed systems that guarantees that if no new updates are made to a given data item, all accesses to that item will eventually return the last updated value.
A document store is a type of NoSQL database that stores data in documents, similar to JSON objects. Each document can have a different structure. (e.g., MongoDB)
A key-value store is a type of NoSQL database that stores data as a collection of key-value pairs. It is the simplest type of NoSQL database. (e.g., Redis)
A column-family store is a type of NoSQL database that stores data in columns rather than rows. This is efficient for queries that only need to access a subset of columns. (e.g., Cassandra)
A graph database is a type of NoSQL database that stores data in nodes and edges, representing relationships between entities. It is well-suited for applications that involve complex relationships, such as social networks and recommendation engines. (e.g., Neo4j)
Triggers are special stored procedures that automatically execute (or "fire") in response to specific events in a database. They cannot be directly called or executed - they only run when their triggering event occurs.
Types of Triggers:
INSERT
, UPDATE
, or DELETE
statementsCREATE
, ALTER
, or DROP
statementsBEFORE
Triggers: Execute before the triggering event. Can modify the data before it's committedAFTER
Triggers: Execute after the triggering event. Often used for logging and auditingCursors are database objects that allow you to process rows individually in a result set. They provide a way to iterate through query results row by row.
DECLARE cursor_name CURSOR FOR SELECT column1, column2 FROM table_name; OPEN cursor_name; FETCH NEXT FROM cursor_name INTO @variable1, @variable2; CLOSE cursor_name; DEALLOCATE cursor_name;
Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
---|---|---|
Purpose | Handle day-to-day transactions | Support analytical queries and reporting |
Data Volume | Large number of short transactions | Fewer, complex queries on large datasets |
Schema | Normalized (3NF) | Denormalized (star/snowflake schema) |
Response Time | Milliseconds | Seconds to minutes |
Example | Banking transactions, order processing | Data warehousing, business intelligence |
Partitioning is the process of dividing a large table into smaller, more manageable pieces while maintaining the logical structure of the table.
Types of Partitioning:
Replication is the process of copying and maintaining database objects in multiple databases. It ensures data availability and improves performance.
Types of Replication:
Connection pooling is a technique used to maintain a cache of database connections that can be reused across multiple requests. This improves performance by avoiding the overhead of establishing new connections.
A data warehouse is a centralized repository that stores integrated data from multiple sources for analytical and reporting purposes. It's optimized for query and analysis rather than transaction processing.
Temporary tables are tables that exist temporarily during a database session. They are automatically dropped when the session ends.
-- Local temporary table (SQL Server) CREATE TABLE #TempTable ( ID INT, Name VARCHAR(50) ); -- Global temporary table (SQL Server) CREATE TABLE ##GlobalTempTable ( ID INT, Name VARCHAR(50) );
Feature | Temporary Tables | Table Variables |
---|---|---|
Scope | Session-wide | Batch/procedure scope |
Indexes | Can create indexes | Limited indexing |
Statistics | Auto-generated | Not generated |
Memory | Can spill to disk | Memory-optimized |
Syntax | #TableName | @TableName |
INT
, BIGINT
, DECIMAL
, FLOAT
, NUMERIC
CHAR
, VARCHAR
, NCHAR
, NVARCHAR
, TEXT
DATE
, TIME
, DATETIME
, TIMESTAMP
BINARY
, VARBINARY
, BLOB
BIT
, BOOLEAN
JSON
(in newer database systems)RBAC is a security model where permissions are assigned to roles, and users are assigned to roles. This simplifies permission management and follows the principle of least privilege.
-- Create role CREATE ROLE sales_team; -- Grant permissions to role GRANT SELECT, INSERT ON customers TO sales_team; -- Assign user to role GRANT sales_team TO john_doe;
Database encryption protects data by converting it into a coded format that can only be read with the appropriate decryption key.
Types:
Database auditing is the process of monitoring and recording database activities to ensure compliance and detect security breaches.
Audit Events:
Views can be used to:
This is a classic problem that can be solved efficiently using a CTE with a window function like DENSE_RANK()
. DENSE_RANK()
is a good choice because it handles ties correctly without skipping ranks.
-- The 'N' should be replaced with the desired rank, e.g., 3 for the 3rd highest. DECLARE @N INT = 3; WITH SalaryRank AS ( SELECT Salary, DENSE_RANK() OVER (ORDER BY Salary DESC) as SalaryRank FROM Employees ) SELECT Salary FROM SalaryRank WHERE SalaryRank = @N;
Note: The syntax for declaring a variable (DECLARE @N
) is for T-SQL (SQL Server). In PostgreSQL or MySQL, you would typically substitute the value directly into the query or use a session variable.
The most straightforward way is to use GROUP BY
on the email column and then use a HAVING
clause to filter for groups with a count greater than one.
SELECT Email, COUNT(Email) as NumOccurrences FROM Customers GROUP BY Email HAVING COUNT(Email) > 1;
The best approach is to use a LEFT JOIN
from Customers
to Orders
and then filter for rows where the Orders
side is NULL
. This indicates the customer had no matching orders.
SELECT c.CustomerName FROM Customers c LEFT JOIN Orders o ON c.CustomerID = o.CustomerID WHERE o.OrderID IS NULL;
An alternative using NOT EXISTS
can also be efficient:
SELECT c.CustomerName FROM Customers c WHERE NOT EXISTS ( SELECT 1 FROM Orders o WHERE o.CustomerID = c.CustomerID );
While you can solve this by modifying the Nth salary query, a common method is to use OFFSET
and FETCH
.
Standard SQL (PostgreSQL, Oracle, SQL Server 2012+):
SELECT Salary FROM Employees ORDER BY Salary DESC OFFSET 1 ROW FETCH NEXT 1 ROW ONLY;
MySQL / PostgreSQL LIMIT
clause:
SELECT Salary FROM Employees ORDER BY Salary DESC LIMIT 1 OFFSET 1;
Using a Subquery (works on most systems):
SELECT MAX(Salary) FROM Employees WHERE Salary < (SELECT MAX(Salary) FROM Employees);
A self-join is the classic approach, where you join the table to itself on the Salary
column, making sure you don't match an employee with themselves.
SELECT DISTINCT e1.EmployeeName, e1.Salary FROM Employees e1 JOIN Employees e2 ON e1.Salary = e2.Salary AND e1.EmployeeID <> e2.EmployeeID;
A more modern approach uses a window function, which can be more efficient as it avoids a join.
WITH SalaryCounts AS ( SELECT EmployeeName, Salary, COUNT(*) OVER (PARTITION BY Salary) as Cnt FROM Employees ) SELECT EmployeeName, Salary FROM SalaryCounts WHERE Cnt > 1;
You can use GROUP BY
with COUNT()
, ORDER BY
, and then select the top result. The syntax for "top result" varies.
PostgreSQL / MySQL:
SELECT Department, COUNT(*) as NumberOfEmployees FROM Employees GROUP BY Department ORDER BY NumberOfEmployees DESC LIMIT 1;
SQL Server:
SELECT TOP 1 Department, COUNT(*) as NumberOfEmployees FROM Employees GROUP BY Department ORDER BY NumberOfEmployees DESC;
This is another "top-N" query. Again, the syntax varies by database.
PostgreSQL / MySQL LIMIT
:
SELECT EmployeeName, Salary FROM Employees ORDER BY Salary DESC LIMIT 5;
SQL Server TOP
:
SELECT TOP 5 EmployeeName, Salary FROM Employees ORDER BY Salary DESC;
Standard SQL FETCH
:
SELECT EmployeeName, Salary FROM Employees ORDER BY Salary DESC FETCH FIRST 5 ROWS ONLY;
This usually means finding rows where the ManagerID
column is NULL
. This often corresponds to the highest-level person in the hierarchy (e.g., the CEO).
SELECT EmployeeName FROM Employees WHERE ManagerID IS NULL;
You can use a window function like COUNT()
to count occurrences of a value within its partition, and then select rows where that count is greater than 1.
WITH DuplicateCheck AS ( SELECT ColumnName, -- The window function counts how many rows share the same ColumnName value COUNT(*) OVER (PARTITION BY ColumnName) as DuplicateCount FROM TableName ) SELECT * FROM DuplicateCheck WHERE DuplicateCount > 1;
The BETWEEN
operator is perfect for this and is inclusive of the start and end dates.
SELECT * FROM TableName WHERE EventDate BETWEEN '2023-01-01' AND '2023-01-31';
This is equivalent to, but more readable than:
SELECT * FROM TableName WHERE EventDate >= '2023-01-01' AND EventDate <= '2023-01-31';
This is a classic use case for a self-join. You join the Employees
table to itself, aliasing one as the "employee" and the other as the "manager".
SELECT e.EmployeeName, e.Salary, m.EmployeeName as ManagerName, m.Salary as ManagerSalary FROM Employees e -- Join the table to itself JOIN Employees m ON e.ManagerID = m.EmployeeID WHERE e.Salary > m.Salary;
A window function with SUM()
is the ideal tool for this. It allows you to sum up values over a specified window of rows.
SELECT Month, Sales, -- Sum the sales from the beginning of the partition up to the current row SUM(Sales) OVER (ORDER BY Month ROWS UNBOUNDED PRECEDING) as RunningTotal FROM MonthlySales ORDER BY Month;
The ROWS UNBOUNDED PRECEDING
part defines the window frame: from the very first row to the current row.
The most portable way is to use conditional aggregation with the CASE
statement. While some databases have a proprietary PIVOT
keyword, this method works on almost any SQL database.
-- Using conditional aggregation SELECT Month, SUM(CASE WHEN Product = 'ProductA' THEN Sales ELSE 0 END) as ProductA, SUM(CASE WHEN Product = 'ProductB' THEN Sales ELSE 0 END) as ProductB, SUM(CASE WHEN Product = 'ProductC' THEN Sales ELSE 0 END) as ProductC FROM Sales GROUP BY Month;
Feature | SQL (Relational) | NoSQL (Non-relational) |
---|---|---|
Data Model | Structured (tables with rows and columns) | Unstructured (documents, key-value, graph) |
Schema | Predefined schema (static) | Dynamic schema |
Scalability | Vertical scaling (scale-up) | Horizontal scaling (scale-out) |
Consistency | ACID (Atomicity, Consistency, Isolation, Durability) | BASE (Basically Available, Soft state, Eventual consistency) |
Examples | MySQL, PostgreSQL, Oracle | MongoDB, Cassandra, Redis |
Get instant AI-powered summaries of YouTube videos and websites. Save time while enhancing your learning experience.