Re: BUG::New table in data warehouse not detected

Element115 · ‎01-15-2024

STEPS

0__create a new data warehouse

1__create a new table in schema [dbo] (with column data types matching or compatible with on-prem SQL Server 2017 source)

2__wait, then refresh until table appears

3__create new Dataflow Gen2

4__connect to on-prem DB and select view that will be used as the source to populate the data warehouse table

5__select newly created data warehouse as destination

6__select newly created table as Existing table

--> ERROR: table never shows up after selecting Exisiting table + the data warehouse is referred to as a table in the message displayed; see screenshot below

Somehow I don't think that is the expected behavior. Please fix.

frlunett · ‎01-15-2024

Thanks for the quick reply.

I was referring to the refresh button highlighted in this UI:

For the slow refresh issue, would you be able to share the refresh session or request ID? You can find them in the Refresh history dialog:

I'm assuming both dataflows belong to the same workspace. In a scenario like this it might be worth disabling staging on the query that pulls the data from the first dataflow and writes the data into the output destination as the data is already in the staging Lakehouse (Data Factory Spotlight: Dataflow Gen2 | Microsoft Fabric Blog | Microsoft Fabric).

Element115 · ‎01-15-2024

EDIT 5 mins after writing what's below:

This is a WTF moment! I left the forum and navigated back to Fabric and opened DF2 without changing anything, closed it, navigated to DataflowsStagingLakehouse and BOOM! The error message you see in the screenshot was gone, the parquet files were gone, and now under Table, I finally had tables!

So we need to let a few minutes go by in order for the tables to show up, is that it?

One issue that remains is that the data that got staged from DF1 and DF2, both these tables have lost the source column names. However, the final table in the staging lakehouse, that is, the table that is the output from DF2, has all the column names of the source. This is very strange and makes no sense. Both the staged tables from DF1 and DF2 should also have the same column names, should they not?

Finally, this is a nasty workaround as you end up staging the same data thrice. Talk about duplication. So... what's the ETA on a proper data gateway that will allow Fabric pipelines to directly access on-prem DBs?

##################################################################################

I restarted the whole process, both DF1 and DF2 have Enable staging set because as I said before, the Publish button disappears so you have to enable staging, and interestingly this time both DFs including their respective refresh, all completed in 8 mins only!

The main difference between this test and the previous one is that in the latter the destination of DF2 was set to a warehouse I had explicitly created, whereas in this test the DF2 destination was set to the staging lakehouse that gets provisioned automatically by the dataflow.

A table got created with the name I had chosen in the lakehouse. Unfortunately, the table appears as a folder containing parquet files under the Table hierarchy instead of containing tabular data. And an error message as per this screenshot. I don't know what's going on, but it's fair to say after days of trying to do this, that there is a bug that needs fixing because otherwise it will be impossible to ingest data from an on-prem SQL Server 2017 DB.

Element115 · ‎01-15-2024

So... DF2 eventually failed with a WriteToDataDestination. Here are the 2 screenshots of the run result:

Still haven't been able to ingest properly data from an on-prem SQL Server DB using a DataFlow Gen2. How is one supposed to do this? Is it even possible???

Also, I read the blogs that are saying to open port 1433 on the machine that runs thePower BI data gateway locally. So we did update the firewall rule according to the Microsoft help page on this topic just before this last attempt with 2 chained DataFlows. And yet, it still ends in failure.

Element115 · ‎01-15-2024

Sorry, my bad. I didn't realize that the datetime was an active link. Here are the IDs.... Request and Session IDs are empty though.

Element115 · ‎01-15-2024

By the way Francesco, is the refresh going to be running forever? If I could, I'd like to stop it seeing it's been more than 5 hours now that it's running, but there is no Stop button or menu option anywhere and most options are greyed out.

Element115 · ‎01-15-2024

~~Re the refresh ID, it is still refreshing so everything is greyed out for now.~~

Yes, both DFs belong to the same workspace. I did not disable Staging for DF1 when I started this test. But I did disable Staging for DF2 because I didn't want the data to be stored in the default lakehouse, rather in a lakehouse I created. So I just opened DF1 and disabled Staging now BUT then the Publish button disappears! Dunno if it is because DF2 refresh is still running and depending on DF1, or whether the button would disappear regardless when first creating the DF.

frlunett · ‎01-16-2024

The DataflowStagingLakehouse/DataflowStagingWarehouse are reserved artifacts used by the dataflow engine internally to stage data and should not be used as output destinations (we are working on improving the UI to prevent this scenario).

The way I see it, you should be able to create a single dataflow that pulls your data from the on-prem SQL server and pushes the data into a WH you created in the same workspace where your dataflow is running. In this case, the source query, the one that pulls the data from the on-prem SQL server and that has the destination set up, can be configured with staging enabled or disabled (and the Publish button should be enabled since you have at least an output destination). Depending on whether you are performing transformation on top of the data being moved, you can expect perf gains by enabling staging (but you would need to structure your query effectively). Chances are that in your scenario staging doesn't have any positive impact - and it could actually make sense to disable it.

I'm following up internally on the 5hours long running dataflow. It seems to be a known issue and we'll work on a fix.

Thanks for continuing to report feedback!

Element115 · ‎01-16-2024

By the way, I am trying to repurpose a third dataflow to chain it to a fourth one, so to do the same thing I did with DF1 and DF2 but against my own created lakehouse instead of the staging lakehouse and I come across multiple issues... one of which is the following:

Since DF3 already had a destination set, I delete it, publish, and then re-open DF3 only to see that the destination is set again EVEN THOUGH I deleted it! So to be clear, after choosing Delete in the menu you see in the screenshot, and Publishing the DF, the delete did not have any effect even though the UI showed that the destination got cleared.

Element115 · ‎01-16-2024

EDIT::one needs to set a destination and then only disable staging, in order for the the Publish button to not disappear. There is no way to publish if staging is disabled and no destination is set.

Interesting... the Publish button does not disappear if I set the destination before disabling Enable staging. So I tried again but this time the destination is a lakehouse I created just for this. And it does not work. No data, no table created by the DF in the lakehouse. This is with one DF, not the workaround with 2 DFs. Whereas with Enable staging, at least you get the data but not the source column names, which get replaced by generic names such as Column1, Column2, etc....

My network admin did open port 1433 and set the Firewall rule according to the Microwsoft doc for the on-prem Power BI data gateway, so...

Here is the screenshot:

frlunett · ‎01-16-2024

From our telemetry it does seem like a gateway connectivity issue. Can you check the error message that's shown when you click on the Sector_TS_* activity from the refresh history dialog?

We do have a blog post that describes limitations and considerations when using Data Gateway with output destination in the context of Dataflow Gen2: On-premises data gateway considerations for data destinations in Dataflow Gen2 - Microsoft Fabric | ...

Finally, it is by design to disallow publishing a dataflow that only contains queries that are not marked as staged nor have an output destination set.

Thanks,
Frank.

Element115 · ‎01-31-2024

Thanks Frank, sorry for the late reply as I got distracted by all the other support tickets opened. The data gateway works fine, well at least with the Power BI Service, the right ports are opened in the firewall and so on. Let's pause the data warehouse issue for now as all the issues concerning data ingestion in the lakehouse are being worked on.

v-gchenna-msft · ‎01-24-2024

Hello @Element115 ,

We haven’t heard from you after the last response by @frlunett - and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

v-gchenna-msft · ‎01-25-2024

Hi @Element115 ,

We haven’t heard from you after the last response by @frlunett - and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

Element115 · ‎01-31-2024

Apologies, I got sidetracked by all the support tickets opened for the other bugs that popped up when using a lakehouse to ingest data from the on-prem DB. Let's put the data warehouse on hold for now, as our scenario benefits more from a lakehouse anyway and let's see if support manages to find a solution to the issues blocking not only ingestion from on-prem, but also from a lakehouse source.

Here is to hoping that the new OPGW will solve all these issues and allow bulk data copy from an on-prem DB to a lakehouse using a simple pipeline.

frlunett · ‎01-15-2024

Hello @Element115,

would you be able to share the SQL statement you used to create the table? What is the name of the table in the Warehouse?

Does it help to refresh the navigator view by pressing the refresh button next to "Display options" on the left pane in the above UI?

Thanks,

Francesco.

Element115 · ‎01-15-2024

Sure thing. Here's the DDL:

CREATE TABLE [dbo].[TS60]
(
    ID_TS       varchar(32) NOT NULL
,   NSF         int 
,   _DATE       date
,   _TIME       time(0)
,   INTENSITY   numeric(10, 6)
,   OCCUPANCY   numeric(8, 6)
,   VELOCITY    numeric(9, 6)
,   VOLUME      int
)
GO

Not sure what you are referring to by 'Display options.' If you are talking about refreshing the table, yes, and the table shows up when we open the warehouse in the Data Engineering persona. I also did a regular browser refresh. None of which matter.

I am currently running a workaround which consists of creating 2 dataflows Gen2. The first ingests from the on-prem DB but without a destination inside Fabric. The second DF ingests from the previous DF and this time defines a Fabric data warehouse as destination. So far, no errors but the refresh has been running for more than 3 hours now. The total dataset size, which comes from a SQL view, is about 3 million records. It's extremely slow apparently.

BUG::New table in data warehouse not detected

Helpful resources

Fabric certifications survey

Fabric Monthly Update - April 2024

Fabric Community Update - April 2024