Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
lemarcfj
Helper I
Helper I

DAX/Query to Dedupe Data From Append Only BigQuery Warehouse

Hi There,

 

I'm desperately in need of a DAX formula or query that will effectively dedupe the Facebook Ads data coming from my BigQuery warehouse. I use an ETL tool to autoreplicate data from FB Ads to the warehouse. Since we have an append-only warehouse, re-replicated rows from the 28 day attribution window from Facebook are appended and results in duplicated values for different dimensions (campaigns). The ETL tool (called Stitch) provided documentation here gives this sample query:  

 

SELECT DISTINCT o.*
FROM [stitch-analytics-bigquery-123:ecommerce.orders] o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM [stitch-analytics-bigquery-123:ecommerce.orders]
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

 

This approach uses a subquery to get a single list of every row’s Primary Key, maximum sequence number(_sdc_sequence), and maximum batched at (_sdc_batched_at) timestamp. Since it’s possible to have duplicate records in your warehouse, the query also selects only distinct records of the latest version of the row. It then joins the original table to the Primary Key, maximum sequence, and maximum batched at, which makes all other column values available for querying.

 

Can someone help with a DAX formula/query that is adapted to what I need? I'm attaching an Excel file with a sample data and _sdc columns from the example above. It also has what an ideal deduped table would look like. Thanks in advance!

1 ACCEPTED SOLUTION

@lemarcfj you are on the wrong path. Avoid dedup'ing on the client side as you would need to import all before doing so - unless PowerQuery is smart enough to unfold correctly (i.e. translate to native SQL).

 

You are better off dedup'ing on BQ side and import the filtered dataset instead.

1. Don't use a BigQuery View to dedup on-the-fly as this will likely be slow.

2. Have instad a separate table that contains the unique 'campaign' records - you can schedule such query in BQ to run daily for instance

3. once table is created/refreshed, query that one in PowerBI. You would avoid ODBC driver and can use the built-in BigQuery connector as-is (and leverage DirectQuery).

 

However, your depup logic is unclear and need rework - for instance I thought you would be interested in summing the spend instead of the last. Even so, you will probably end up using a window fuction like ROW_NUM() to sort by whatever and keep the first record of each window (e.g. campaign).

You can search on Stackoverflow as many have asked how to dedup already - or you can ask a new question if you are specific enough.

View solution in original post

17 REPLIES 17

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.