SQL Code Samples

Table of Contents

Joins

This query retrieves counts of content scanned, flagged, and deleted for each active business user.

Highlights:

  • Joins: Combines BusinessCustomer, Business, and Content tables.
  • Filtering: Focuses only on businesses with an active status.
  • Aggregations: Counts distinct content scanned, flagged, and deleted for each business customer.
  • Grouping: Groups results by business customer ID and business name for clarity.

SELECT
    bc."userId" AS business_customer_id,
    b."name" AS business_name,
    COUNT(DISTINCT c."id") AS num_scanned_content,
    COUNT(DISTINCT CASE WHEN c."isFlagged" = true THEN c."id" END) AS num_flagged_content,
    COUNT(DISTINCT CASE WHEN c."isDeleted" = true THEN c."id" END) AS num_deleted_content
FROM
    public."BusinessCustomer" AS bc
LEFT JOIN
    public."Business" AS b ON bc."businessId" = b."id"
LEFT JOIN
    public."Content" AS c ON bc."userId" = c."userId"
WHERE
    b."businessStatus" = 'Active'
GROUP BY
    bc."userId", b."name"
ORDER BY
    business_customer_id;
        

Aggregates

This query calculates the number of distinct users connected to each social media platform.

Highlights:

  • Conditional Aggregation: Uses SUM(CASE...) to count users per platform.
  • Deduplication: Ensures only unique users per platform are considered.
  • Scalability: Can be extended easily for additional platforms.

SELECT 
    SUM(CASE WHEN "origin" = 'Facebook' THEN 1 ELSE 0 END) AS "FacebookUsers",
    SUM(CASE WHEN "origin" = 'Twitter' THEN 1 ELSE 0 END) AS "TwitterUsers",
    SUM(CASE WHEN "origin" = 'Instagram' THEN 1 ELSE 0 END) AS "InstagramUsers",
    SUM(CASE WHEN "origin" = 'Tiktok' THEN 1 ELSE 0 END) AS "TiktokUsers"
FROM 
    (SELECT DISTINCT "userId", "origin" FROM "Content") AS distinct_origins;
        

Common Table Expressions (CTEs)

This query uses a CTE to clean duplicate rows in the dataset.

Highlights:

  • CTE: Creates a temporary result set with row numbers for duplicate rows.
  • Partitioning: Groups rows by contentId, userId, and type to identify duplicates.
  • Row Numbering: Keeps the earliest entry based on the createdAt timestamp.
  • Deletion: Removes rows where the row number exceeds 1, effectively deduplicating the dataset.

WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);
        

Partitions

This query demonstrates how to use partitions for row-level operations, such as keeping only the earliest created record.

Highlights:

  • CTE: Temporarily calculates row numbers for duplicate rows.
  • Partitioning: Organizes duplicates by specific columns like contentId, userId, and type.
  • Row Numbering: Identifies the earliest entry for each duplicate set.
  • Deletion: Removes all rows except the first occurrence in each partition.

WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);
        

Data Cleaning Techniques

This series of SQL commands demonstrates how to clean and standardize data.

Step 1: Remove Rows with Missing Critical Columns

Highlights:

  • Filtering: Deletes rows where critical columns like contentId, userId, or type are NULL.

DELETE FROM public."ContentAnalysis"
WHERE "contentId" IS NULL
   OR "userId" IS NULL
   OR "type" IS NULL;
        

Step 2: Handle Duplicate Rows

Highlights:

  • CTE: Identifies duplicate rows using row numbers.
  • Partitioning: Groups rows by contentId, userId, and type to detect duplicates.
  • Row Numbering: Keeps only the earliest row within each group.
  • Deletion: Removes duplicate rows based on the row number.

WITH cte AS (
    SELECT 
        "id",
        ROW_NUMBER() OVER (
            PARTITION BY "contentId", "userId", "type"
            ORDER BY "createdAt"
        ) AS row_num
    FROM public."ContentAnalysis"
)
DELETE FROM public."ContentAnalysis"
WHERE "id" IN (
    SELECT "id" 
    FROM cte 
    WHERE row_num > 1
);
        

Step 3: Handle Inconsistent Data in the "type" Column

Highlights:

  • Standardization: Converts all type values to uppercase for consistency.
  • Filtering: Ensures only non-NULL type values are updated.

UPDATE public."ContentAnalysis"
SET "type" = UPPER("type")
WHERE "type" IS NOT NULL;
        

Step 4: Verify Data Consistency

Highlights:

  • Validation: Checks if updatedAt is earlier than createdAt.
  • Anomaly Detection: Identifies records with logical inconsistencies in date fields.

SELECT * 
FROM public."ContentAnalysis"
WHERE "updatedAt" < "createdAt";