Solved: Delta Tables in LakeHouse missing in SQL Endpoint

demothy · ‎05-16-2024

I have a number of tables loaded from a large set of incremetal json files. Once my load completed I noticed that some of the tables I can see in the Lakehouse are not showing up in SQL Endpoint. Ive tried refreshing and restarting my browser but still no luck.

I also noted that SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES also does not include the missing Delta Tables but I can clearly see that they do exist in the Lakehouse and checking the properties of them in the Lakehouse, indicates all the missing tables are both "Managed" and "Delta" tables.

Ive been runnng initial loads "repeatedly" and my code drops the delta table for the initial load. It seems the ones missing are the ones that took the longest to load.

I have a Data Pipeline that uses a lookup to return metadata to a ForEach(runs all in parralel).

In the foreach I call a "controler" notebook, passing the metadata to it.
The controler notebook calls a "metadata parser" notebook and the result of the parser is passed to a call of my "data notebook". I was experiencing timeouts on longer running jobs so I set the Timeout property on the msu.notebook.run for the call to the "data notebook" to 2400 secs and the issue was resolved.

Im am running this inside a trail fabric capacity.

Any help appriciated.

@v-nikhilan-msft

demothy · ‎05-19-2024

Thanks @v-nikhilan-msft

I note that after the weekend that the missing tables have now showed up in SQL Endpoint. However not sure that the business is prepared to wait several days for data to be available after ingestion.

How are you loading these tables into lakehouse?
Im creating tables using:

(df

.write

.mode(mode)

.partitionBy("file_name","year", "month" ,"day")

.format("delta")

.save(deltaTableName))

note: mode = "overwrite"

How large is the volume of the tables?
Im new to notebooks/pyspark/delta tables havng come from a sql based background, so not sure how to calculate that in terms of what your asking?

2) The issue can be due to a schema mismatch. You might need to update the schema of your Delta tables or fix any schema mismatches to ensure that all tables are included in the INFORMATION_SCHEMA.TABLES.
In terms of the schema, in my data load code, for the initial load the code reads 2 input folders into two df's, one containing "current files" and the other "archived files". The code then unions the complete list of columns from both df's, creating a common cols list and then ensures both data frames have the same columns in the same order. The df's are then unioned and the code then applies rules to force data types on specific columns. i.e the schem infered from each df can vary so I have rules like "if column_name endswith "_ID" then IntergerType() or if column_)name = "LAST_MODIFIED_DATE" then TimestampType() etc etc.

Kind Regards

Tim

View solution in original post

demothy · ‎05-19-2024

Thanks @v-nikhilan-msft

I note that after the weekend that the missing tables have now showed up in SQL Endpoint. However not sure that the business is prepared to wait several days for data to be available after ingestion.

How are you loading these tables into lakehouse?
Im creating tables using:

(df

.write

.mode(mode)

.partitionBy("file_name","year", "month" ,"day")

.format("delta")

.save(deltaTableName))

note: mode = "overwrite"

How large is the volume of the tables?
Im new to notebooks/pyspark/delta tables havng come from a sql based background, so not sure how to calculate that in terms of what your asking?

2) The issue can be due to a schema mismatch. You might need to update the schema of your Delta tables or fix any schema mismatches to ensure that all tables are included in the INFORMATION_SCHEMA.TABLES.
In terms of the schema, in my data load code, for the initial load the code reads 2 input folders into two df's, one containing "current files" and the other "archived files". The code then unions the complete list of columns from both df's, creating a common cols list and then ensures both data frames have the same columns in the same order. The df's are then unioned and the code then applies rules to force data types on specific columns. i.e the schem infered from each df can vary so I have rules like "if column_name endswith "_ID" then IntergerType() or if column_)name = "LAST_MODIFIED_DATE" then TimestampType() etc etc.

Kind Regards

Tim

v-nikhilan-msft · ‎05-20-2024

Hi @demothy
Thanks for providing the details. As you are able to see the tables now, this might be a temporary glitch. Please let me know if you face the issue again.
Thanks

v-nikhilan-msft · ‎05-16-2024

Hi @demothy
Thanks for using Fabric Community.
As per your description, the reason for tables not showing in SQL endpoint might be:

1) The visibility of tables in the SQL endpoint is more likely related to how the tables were created. If the tables were created with Spark code, they won’t be visible in the SQL endpoint. (This is not valid in your case as you have clearly specified the table properties are managed and delta)
2) The issue can be due to a schema mismatch. You might need to update the schema of your Delta tables or fix any schema mismatches to ensure that all tables are included in the INFORMATION_SCHEMA.TABLES.

3) If the initial loads for some tables took longer due to resource limitations, the Delta tables might not have been created successfully. This could explain the missing tables, even though they exist in the Lakehouse (but not fully formed).

Limitations of the SQL endpoint:
Limitations - Microsoft Fabric | Microsoft Learn
Try checking the data types of the columns: Data types - Microsoft Fabric | Microsoft Learn

How are you loading these tables into lakehouse? How large is the volume of the table?

Please let me know if you have any further questions.