Home merchants analysis Data Review & Cleaning

Data Review & Cleaning Summary

  • ORDER_ID not unique (1,091 duplicates)
  • Extreme merchant concentration: 87% have less than 10 orders
  • 5,229 records with date logic violations.

Dataset Overview

  • Orders dataset: 311,645 records
  • Line items dataset: 411,581 records

Data Quality Issues Identified

Order ID Not Unique

  • 1,091 orders representing separate legitimate transactions
  • Different timestamps, addresses, costs - not data entry errors
  • Solution: Use DISTINCT on order_id field for analysis while preserving underlying data

Date Logic Violations

  • 5,229.00 orders with fulfilled date before order date
  • 34 orders with registration issues
  • Solution: Retain with appropriate filtering in analysis queries

Data Integrity

No Results
  • Zero null values in primary keys (order_id, merchant_id, shop_id, order_dt)
  • 1 order without line items (ORDER_ID: 719886.143) - kept as non-impactful
  • No orphaned line items

Merchants Distribution: Extreme Concentration

87% of merchants have less than 10 orders over 6 months. Classic marketplace power-user distribution requiring segmented analysis approach:

No Results