Solved: Re: tag duplicates

agross · ‎07-13-2018

Dear Community,

Thank you for adding. I am working for a flight consolidator in Germany and have been using PBI recently. I am very excited about the possibilities.

Now I am a little lost when it comes to the identification of duplicates. I know that there have already been several posts. I have already tried many suggestions of them. However, so far no solution fits to my problem.

I have a daily error report that shows faulty booking attempts of flights. Some agencies try to book the same flight again and again when it comes to an error. In my visualizations I want to make use of a slicer to distinguish between first and all booking attempts. So far I solved it in Excel with two additional columns. One was to concantenate all crucial cells to generate an unique ID. And the second one was to identify if there is a repeated ID within the same day. So here I used MATCH and COUNTIF formulas:

As this worked very well I now try to add these calculated columnes also to PBI but I fail to do so. Ok, first one (F) is an easy one to concantenate columns. But second (G) I do not know.

What formula I have to use to tag the first and and also the repeated attempts/IDs within 1 day? Is there an additional order necessary or does the 'booking date' take over the role here?

I'm very grateful for any feedback.

Thank you very much in advance.
Alex

v-shex-msft · ‎07-15-2018

Hi @agross,

In power bi data model, it not contain column index and row index, so dax formula won't recognize first record and will mark all similar records as duplicate.

In my opinion, you can add index column in query editor, then you can use below formula to loop on concatenated text and mark duplicate records.

Duplicate = 
VAR uniqueCount =
    CALCULATE (
        COUNT ( 'Sample'[CONCATENATE] ),
        FILTER (
            ALL ( 'Sample' ),
            [Index] <= EARLIER ( [Index] )
                && [CONCATENATE] = EARLIER ( 'Sample'[CONCATENATE] )
                && [Booking Date] = EARLIER ( [Booking Date] )
        )
    )
RETURN
    IF ( uniqueCount >= 2, "Yes", "No" )

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

View solution in original post

v-shex-msft · ‎07-15-2018

Hi @agross,

In power bi data model, it not contain column index and row index, so dax formula won't recognize first record and will mark all similar records as duplicate.

In my opinion, you can add index column in query editor, then you can use below formula to loop on concatenated text and mark duplicate records.

Duplicate = 
VAR uniqueCount =
    CALCULATE (
        COUNT ( 'Sample'[CONCATENATE] ),
        FILTER (
            ALL ( 'Sample' ),
            [Index] <= EARLIER ( [Index] )
                && [CONCATENATE] = EARLIER ( 'Sample'[CONCATENATE] )
                && [Booking Date] = EARLIER ( [Booking Date] )
        )
    )
RETURN
    IF ( uniqueCount >= 2, "Yes", "No" )

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

agross · ‎07-16-2018

Hi Xiaoxin Sheng (@v-shex-msft),

Great! Thank you very much for your assistance. Your proposal seems to work. This is how I get to know the correct use of the formulas.

The problem now I am facing is probably the amount of data because PBI interrupts the calculation saying (in German) "There is not enough memory to perform this operation. Please try again later if more memory may be available." So at the moment I have 230.000 lines. Any suggestions how this could be solved? Or are there other ways to achieve the above goal?

Next to that I wonder what happens with the index column if I add new data into the master data XLSX. Does it stay and update itself in PBI?

Thanks again.

Best regards

Alex

v-shex-msft · ‎07-18-2018

HI @agross,

I'd like to suggest you to use 64 bit desktop to instead 32 bit power bi, it has better performance to handle on huge amount records.

Differences between 32 & 64 bit versions of Power BI Desktop

BTW, I'd like to suggest close unused applications to increase idle memory for calculation.

>> Does it stay and update itself in PBI?

Yes, custom index function will re-generate new index after update records.

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

agross · ‎07-19-2018

Hi @v-shex-msft,

Thanks a lot for your help again.

I already use the 64 bit desktop version and also closed other applications. My computer has 8 GB RAM. Power BI always stops mentioned calculation for my 250,000 lines.

Is there maybe a smarter way to have a slicer that distinguishs between unique and all values for my given data?

Thanks for any idea on the above.

Best regards

Alex

v-shex-msft · ‎07-24-2018

Hi @agross,

I think the issue may related to my formula, it will calcutle through full table so cost huge amount of memory(due to records amount).

Maybe you can try to use below measure to instead, drag it to table visual to display tag.(please add filters on visual to confirm formula not calculate on all records)

Duplicate =
VAR currIndex =
    MAX ( 'Sample'[Index] )
VAR currID =
    SELECTEDVALUE ( 'Sample'[CONCATENATE] )
VAR currDate =
    MAX ( 'Sample'[Booking Date] )
VAR uniqueCount =
    CALCULATE (
        COUNT ( 'Sample'[CONCATENATE] ),
        FILTER (
            ALL ( 'Sample' ),
            [Index] <= currIndex
                && [CONCATENATE] = currID
                && [Booking Date] = currDate
        )
    )
RETURN
    IF ( uniqueCount >= 2, "Yes", "No" )

Regards,
Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

tag duplicates

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly