cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Helper I
Helper I

DAX/Query to Dedupe Data From Append Only BigQuery Warehouse

Hi There,

 

I'm desperately in need of a DAX formula or query that will effectively dedupe the Facebook Ads data coming from my BigQuery warehouse. I use an ETL tool to autoreplicate data from FB Ads to the warehouse. Since we have an append-only warehouse, re-replicated rows from the 28 day attribution window from Facebook are appended and results in duplicated values for different dimensions (campaigns). The ETL tool (called Stitch) provided documentation here gives this sample query:  

 

SELECT DISTINCT o.*
FROM [stitch-analytics-bigquery-123:ecommerce.orders] o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM [stitch-analytics-bigquery-123:ecommerce.orders]
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

 

This approach uses a subquery to get a single list of every row’s Primary Key, maximum sequence number(_sdc_sequence), and maximum batched at (_sdc_batched_at) timestamp. Since it’s possible to have duplicate records in your warehouse, the query also selects only distinct records of the latest version of the row. It then joins the original table to the Primary Key, maximum sequence, and maximum batched at, which makes all other column values available for querying.

 

Can someone help with a DAX formula/query that is adapted to what I need? I'm attaching an Excel file with a sample data and _sdc columns from the example above. It also has what an ideal deduped table would look like. Thanks in advance!

1 ACCEPTED SOLUTION

@lemarcfj you are on the wrong path. Avoid dedup'ing on the client side as you would need to import all before doing so - unless PowerQuery is smart enough to unfold correctly (i.e. translate to native SQL).

 

You are better off dedup'ing on BQ side and import the filtered dataset instead.

1. Don't use a BigQuery View to dedup on-the-fly as this will likely be slow.

2. Have instad a separate table that contains the unique 'campaign' records - you can schedule such query in BQ to run daily for instance

3. once table is created/refreshed, query that one in PowerBI. You would avoid ODBC driver and can use the built-in BigQuery connector as-is (and leverage DirectQuery).

 

However, your depup logic is unclear and need rework - for instance I thought you would be interested in summing the spend instead of the last. Even so, you will probably end up using a window fuction like ROW_NUM() to sort by whatever and keep the first record of each window (e.g. campaign).

You can search on Stackoverflow as many have asked how to dedup already - or you can ask a new question if you are specific enough.

View solution in original post

17 REPLIES 17
Microsoft
Microsoft

@lemarcfj,

Create the following column in your table.

RANK = RANKX(FILTER('Raw','Raw'[Facebook Campaign]=EARLIER('Raw'[Facebook Campaign])&&'Raw'[_sdc_sequence]=EARLIER('Raw'[_sdc_sequence])),'Raw'[_sdc_batched_at],,DESC)

1.PNG

Create the following measures in your table.

MAXSDC_SEQUENCE = CALCULATE(MAX('Raw'[_sdc_sequence]),ALLEXCEPT('Raw','Raw'[Facebook Campaign]))
chkmeasure = IF(MAX('Raw'[_sdc_sequence])=[MAXSDC_SEQUENCE]&&MAX('Raw'[RANK])=1,1,0)

2.PNG

Create a table visual as shown in the following screenshot, drag chkmeasure into visual level filter and set its value to 1.
3.PNG

Regards,
Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

I think I need this entire query completed upon entry into the data model so that the data that I'm presenting in dashboards is already pulling from the most "recent" table. I don't have the internal resources to write the query directly into BigQuery. How can this be done so that the data pulled into the model is using this query? 

 

Another example of where this is a problem is if a Campaign_Name doesn't have a 1 = true, it'll pull in blank values which would be the case any time I don't include the most recent batched date included in the filter. Does that make sense? 

Apologies you can actually disregard the last message. I had the DAX missing a symbol. Your solution seems to be partially working as it did indeed pull in a closer number to a deduped value, but som of the campain_name's are missing. The logic that is identifying campaign names with a 1 or 0 isn't considering ALL campaign_name variants. 

@lemarcfj,

Could you please post all the data in your table and post expected result based on the sample data?

In addition, when using Google BigQuery connector in Power BI Desktop, it is not possible to write SQL statement. A method is to create a ODBC data source for bigQuery, then connect to the ODBC data source along with SQL statement.

Regards,
Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

@lemarcfj you are on the wrong path. Avoid dedup'ing on the client side as you would need to import all before doing so - unless PowerQuery is smart enough to unfold correctly (i.e. translate to native SQL).

 

You are better off dedup'ing on BQ side and import the filtered dataset instead.

1. Don't use a BigQuery View to dedup on-the-fly as this will likely be slow.

2. Have instad a separate table that contains the unique 'campaign' records - you can schedule such query in BQ to run daily for instance

3. once table is created/refreshed, query that one in PowerBI. You would avoid ODBC driver and can use the built-in BigQuery connector as-is (and leverage DirectQuery).

 

However, your depup logic is unclear and need rework - for instance I thought you would be interested in summing the spend instead of the last. Even so, you will probably end up using a window fuction like ROW_NUM() to sort by whatever and keep the first record of each window (e.g. campaign).

You can search on Stackoverflow as many have asked how to dedup already - or you can ask a new question if you are specific enough.

View solution in original post

Thanks for your reply. I suspected that it would make mores sense to dedupe on the BQ side and you just confirmed that. I was able to get my hands on the query that I need to make this happen, but unfortunately I'm more of a hybrid strategy/technical analyst and wouldn't know what to do with the query within BQ interface as it relates to scheduling and what not. 

Not for this PowerBI forum anyhow.

 

Research online or reach out to Stackoverflow. On the latter, do show signs that you tried something as the community is there to help, not to provide free consulting. Btw, many of us have hybrid hats - I am a tax consultant - so see this as a learning opportunity 😉

Good luck !

Hi, 

 

I was able to get my query to work on the BQ side, thank you for your suggestion. I didn't personally write the query but it worked nonetheless. 

 

I'm hoping I can figure out how the primary key is defined/selected so that I can repurpose the query for other tables that I'm having this issue for. Thanks again. 

 

LeMarc

 

Thanks. I tried to add the calculated column using:

RANK = RANKX(FILTER('FB_ads_insights','FB_ads_insights'[campaign_name]=EARLIER('FB_ads_insights'[campaign_name])&'FB_ads_insights'[_sdc_sequence]=EARLIER('FB_ads_insights'[_sdc_sequence])),'FB_ads_insights'[_sdc_batched_at],,DESC)

 

And got the following error: DAX comparison operations do not support comparing values of type True/False with values of type Text. Consider using the VALUE or FORMAT function to convert one of the values.

 

Notice that I used the true table/column names here. Which columns need to be what formatting? The sdc_sequence is coming in as a decimal number but is being converted to scientific notation in Power BI (134343 + N) or whatever its called. Should I convert that to a different format? 

Super User IV
Super User IV

Sample data and expected output would help tremendously. Does it matter if the solution is in DAX or Power Query? Please see this post regarding How to Get Your Question Answered Quickly: https://community.powerbi.com/t5/Community-Blog/How-to-Get-Your-Question-Answered-Quickly/ba-p/38490


---------------------------------------

@ me in replies or I'll lose your thread!!!

I have a NEW book! 
DAX Cookbook from Packt
Over 120 DAX Recipes!




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!




Apologies for that. I don't care what it is as long as I get what I need. I'd imagine this would be more of a Power Query. 

 

Sample data can be found here.

@lemarcfj,

Could you please describe more details about the logic you use to get Deduped data from Raw table? I try to execute the SQL query you provide, but I found that there is no id column in the sample excel file.

Regards,

Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

So that SQL query was just a sample query that I found from the Slack blog post here. In my sample file I believe the dimension would we'd want to use instead of "id" would be campaign name as that is the dimension that I would like to view the data by and then a deduped spend metric. 

Hi,

 

Yes I do! As an FYI, that sample query I provided has no relation to my sample data set. The query needs to be adapted for my dataset and what I'm hoping to achieve. So while the query wil achieve deduplication, it isn't a query that was built specifically for me. That said, yes [Facebook Campaign] is my primary dimension. 

 

Thanks a bunch for your assistance with this. 

 

 

@lemarcfj


Could you please describe the logic you use to get Dedupe date based on sample data? 

Regards,
Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hi Lydia - let me know if this helps: 

 

Grab the latest version of every row

In each Stitch-generated integration table, you’ll see a few columns prepended with _sdc. The column we’ll focus on here is the _sdc_sequence column. This column is a Unix epoch (down to the millisecond) attached to the record during replication and can help determine the order of all the versions of a row.

Stitch uses these sequence values in a few places to correctly order rows for loading, but it can be also used to grab the latest version of a record in an append-only table.

Let’s take a look at an example. Assume we have an orders table that contains:

  • A Primary Key of id,
  • The system _sdc columns added by Stitch, and
  • Other order attribute columns

If you wanted to create a snapshot of the latest version of this table, you could run a query like this:

    SELECT DISTINCT o.*
      FROM [stitch-analytics-bigquery-123:ecommerce.orders] o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM [stitch-analytics-bigquery-123:ecommerce.orders]
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

This approach uses a subquery to get a single list of every row’s Primary Key, maximum sequence number, and maximum batched at timestamp. Since it’s possible to have duplicate records in your warehouse, the query also selects only distinct records of the latest version of the row. It then joins the original table to the Primary Key, maximum sequence, and maximum batched at, which makes all other column values available for querying

@lemarcfj,

Do you mean [Facebook Campaign]? When I run the SQL query below based on your sample table, I didn't get any result in 

   SELECT DISTINCT o.*
      FROM [stitch-analytics-bigquery-123:ecommerce.orders] o
INNER JOIN (
     SELECT [Facebook Campaign],
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM [stitch-analytics-bigquery-123:ecommerce.orders]
    GROUP BY [Facebook Campaign]) oo
ON o.[Facebook Campaign] = oo.[Facebook Campaign]
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

1.PNG

Regards,
Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
secondImage

Congratulations!

We are excited to announce the Power BI Super Users!

Wave Release 2

Check out the updates in Power BI.

Overview of Power BI 2020 release wave 2!

Microsoft Ignite

Microsoft Ignite

Join digitally, March 2–4, 2021 to explore new tech that's ready to implement. Experience the keynote in mixed reality through AltspaceVR!

secondImage

The largest Power BI virtual conference

100+ sessions, 100+ speakers, Product managers, MVPs, and experts. All about Power BI. Attend online or watch the recordings.

Top Solution Authors
Top Kudoed Authors