SQL Queries

Joins
Aggregates
Common Table Expressions (CTEs)
Partitions
Data Cleaning Techniques

Joins

This query retrieves counts of content scanned, flagged, and deleted for each active business user.

Highlights:

Joins: Combines BusinessCustomer, Business, and Content tables.
Filtering: Focuses only on businesses with an active status.
Aggregations: Counts distinct content scanned, flagged, and deleted for each business customer.
Grouping: Groups results by business customer ID and business name for clarity.


SELECT
    bc."userId" AS business_customer_id,
    b."name" AS business_name,
    COUNT(DISTINCT c."id") AS num_scanned_content,
    COUNT(DISTINCT CASE WHEN c."isFlagged" = true THEN c."id" END) AS num_flagged_content,
    COUNT(DISTINCT CASE WHEN c."isDeleted" = true THEN c."id" END) AS num_deleted_content
FROM
    public."BusinessCustomer" AS bc
LEFT JOIN
    public."Business" AS b ON bc."businessId" = b."id"
LEFT JOIN
    public."Content" AS c ON bc."userId" = c."userId"
WHERE
    b."businessStatus" = 'Active'
GROUP BY
    bc."userId", b."name"
ORDER BY
    business_customer_id;

Aggregates

This query calculates the number of distinct users connected to each social media platform.

Highlights:

Conditional Aggregation: Uses SUM(CASE...) to count users per platform.
Deduplication: Ensures only unique users per platform are considered.
Scalability: Can be extended easily for additional platforms.


SELECT 
    SUM(CASE WHEN "origin" = 'Facebook' THEN 1 ELSE 0 END) AS "FacebookUsers",
    SUM(CASE WHEN "origin" = 'Twitter' THEN 1 ELSE 0 END) AS "TwitterUsers",
    SUM(CASE WHEN "origin" = 'Instagram' THEN 1 ELSE 0 END) AS "InstagramUsers",
    SUM(CASE WHEN "origin" = 'Tiktok' THEN 1 ELSE 0 END) AS "TiktokUsers"
FROM 
    (SELECT DISTINCT "userId", "origin" FROM "Content") AS distinct_origins;

Common Table Expressions (CTEs)

This query uses a CTE to clean duplicate rows in the dataset.

Highlights:

CTE: Creates a temporary result set with row numbers for duplicate rows.
Partitioning: Groups rows by contentId, userId, and type to identify duplicates.
Row Numbering: Keeps the earliest entry based on the createdAt timestamp.
Deletion: Removes rows where the row number exceeds 1, effectively deduplicating the dataset.


WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);

Partitions

This query demonstrates how to use partitions for row-level operations, such as keeping only the earliest created record.

Highlights:

CTE: Temporarily calculates row numbers for duplicate rows.
Partitioning: Organizes duplicates by specific columns like contentId, userId, and type.
Row Numbering: Identifies the earliest entry for each duplicate set.
Deletion: Removes all rows except the first occurrence in each partition.


WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);

Data Cleaning Techniques

This series of SQL commands demonstrates how to clean and standardize data.

Step 1: Remove Rows with Missing Critical Columns

Highlights:

Filtering: Deletes rows where critical columns like contentId, userId, or type are NULL.


DELETE FROM public."ContentAnalysis"
WHERE "contentId" IS NULL
   OR "userId" IS NULL
   OR "type" IS NULL;

Step 2: Handle Duplicate Rows

Highlights:

CTE: Identifies duplicate rows using row numbers.
Partitioning: Groups rows by contentId, userId, and type to detect duplicates.
Row Numbering: Keeps only the earliest row within each group.
Deletion: Removes duplicate rows based on the row number.


WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);

Step 3: Handle Inconsistent Data in the "type" Column

Highlights:

Standardization: Converts all type values to uppercase for consistency.
Filtering: Ensures only non-NULL type values are updated.


UPDATE public."ContentAnalysis"
SET "type" = UPPER("type")
WHERE "type" IS NOT NULL;

Step 4: Verify Data Consistency

Highlights:

Validation: Checks if updatedAt is earlier than createdAt.
Anomaly Detection: Identifies records with logical inconsistencies in date fields.


SELECT * 
FROM public."ContentAnalysis"
WHERE "updatedAt" < "createdAt";

SQL Code Samples

Table of Contents

Joins

Highlights:

Aggregates

Highlights:

Common Table Expressions (CTEs)

Highlights:

Partitions

Highlights:

Data Cleaning Techniques

Step 1: Remove Rows with Missing Critical Columns

Highlights:

Step 2: Handle Duplicate Rows

Highlights:

Step 3: Handle Inconsistent Data in the "type" Column

Highlights:

Step 4: Verify Data Consistency

Highlights: