Table of Contents
Joins
This query retrieves counts of content scanned, flagged, and deleted for each active business user.
Highlights:
- Joins: Combines
BusinessCustomer
,Business
, andContent
tables. - Filtering: Focuses only on businesses with an active status.
- Aggregations: Counts distinct content scanned, flagged, and deleted for each business customer.
- Grouping: Groups results by business customer ID and business name for clarity.
SELECT
bc."userId" AS business_customer_id,
b."name" AS business_name,
COUNT(DISTINCT c."id") AS num_scanned_content,
COUNT(DISTINCT CASE WHEN c."isFlagged" = true THEN c."id" END) AS num_flagged_content,
COUNT(DISTINCT CASE WHEN c."isDeleted" = true THEN c."id" END) AS num_deleted_content
FROM
public."BusinessCustomer" AS bc
LEFT JOIN
public."Business" AS b ON bc."businessId" = b."id"
LEFT JOIN
public."Content" AS c ON bc."userId" = c."userId"
WHERE
b."businessStatus" = 'Active'
GROUP BY
bc."userId", b."name"
ORDER BY
business_customer_id;
Aggregates
This query calculates the number of distinct users connected to each social media platform.
Highlights:
- Conditional Aggregation: Uses
SUM(CASE...)
to count users per platform. - Deduplication: Ensures only unique users per platform are considered.
- Scalability: Can be extended easily for additional platforms.
SELECT
SUM(CASE WHEN "origin" = 'Facebook' THEN 1 ELSE 0 END) AS "FacebookUsers",
SUM(CASE WHEN "origin" = 'Twitter' THEN 1 ELSE 0 END) AS "TwitterUsers",
SUM(CASE WHEN "origin" = 'Instagram' THEN 1 ELSE 0 END) AS "InstagramUsers",
SUM(CASE WHEN "origin" = 'Tiktok' THEN 1 ELSE 0 END) AS "TiktokUsers"
FROM
(SELECT DISTINCT "userId", "origin" FROM "Content") AS distinct_origins;
Common Table Expressions (CTEs)
This query uses a CTE to clean duplicate rows in the dataset.
Highlights:
- CTE: Creates a temporary result set with row numbers for duplicate rows.
- Partitioning: Groups rows by
contentId
,userId
, andtype
to identify duplicates. - Row Numbering: Keeps the earliest entry based on the
createdAt
timestamp. - Deletion: Removes rows where the row number exceeds 1, effectively deduplicating the dataset.
WITH cte AS (
SELECT
"id",
ROW_NUMBER() OVER (
PARTITION BY "contentId", "userId", "type"
ORDER BY "createdAt"
) AS row_num
FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
SELECT "id"
FROM cte
WHERE row_num > 1
);
Partitions
This query demonstrates how to use partitions for row-level operations, such as keeping only the earliest created record.
Highlights:
- CTE: Temporarily calculates row numbers for duplicate rows.
- Partitioning: Organizes duplicates by specific columns like
contentId
,userId
, andtype
. - Row Numbering: Identifies the earliest entry for each duplicate set.
- Deletion: Removes all rows except the first occurrence in each partition.
WITH cte AS (
SELECT
"id",
ROW_NUMBER() OVER (
PARTITION BY "contentId", "userId", "type"
ORDER BY "createdAt"
) AS row_num
FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
SELECT "id"
FROM cte
WHERE row_num > 1
);
Data Cleaning Techniques
This series of SQL commands demonstrates how to clean and standardize data.
Step 1: Remove Rows with Missing Critical Columns
Highlights:
- Filtering: Deletes rows where critical columns like
contentId
,userId
, ortype
areNULL
.
DELETE FROM public."ContentAnalysis"
WHERE "contentId" IS NULL
OR "userId" IS NULL
OR "type" IS NULL;
Step 2: Handle Duplicate Rows
Highlights:
- CTE: Identifies duplicate rows using row numbers.
- Partitioning: Groups rows by
contentId
,userId
, andtype
to detect duplicates. - Row Numbering: Keeps only the earliest row within each group.
- Deletion: Removes duplicate rows based on the row number.
WITH cte AS (
SELECT
"id",
ROW_NUMBER() OVER (
PARTITION BY "contentId", "userId", "type"
ORDER BY "createdAt"
) AS row_num
FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
SELECT "id"
FROM cte
WHERE row_num > 1
);
Step 3: Handle Inconsistent Data in the "type" Column
Highlights:
- Standardization: Converts all
type
values to uppercase for consistency. - Filtering: Ensures only non-
NULL
type
values are updated.
UPDATE public."ContentAnalysis"
SET "type" = UPPER("type")
WHERE "type" IS NOT NULL;
Step 4: Verify Data Consistency
Highlights:
- Validation: Checks if
updatedAt
is earlier thancreatedAt
. - Anomaly Detection: Identifies records with logical inconsistencies in date fields.
SELECT *
FROM public."ContentAnalysis"
WHERE "updatedAt" < "createdAt";