Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.
I generated partitioned Azure DataLake Gen2 Parquet files using a spark notebook and can query them with no issues in Synapse Analytics using serverless SQL.
When I attempt to create a DataFlow using Azure DataLake Gen 2 storage using Url https://<my account>.dfs.core.windows.net/mothership/Transactions/SellingOrganizationId=3141
I see the 4 expected files.
However, when I press Combine I get
Am I doing something wrong or is this not supported.
Here is the POC spark notebook that generates the parquet file
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
var df = spark.Read().Json("abfss://bdmtest@<myaccount>.dfs.core.windows.net/FLT/Trans/2011/04/30/**");
//get the schema of the data frame
var dfSchema = df.Schema() ;
foreach(var parentSchemaField in dfSchema.Fields) {
if (parentSchemaField.DataType is StructType) {
var childFrame = df.Select($"{parentSchemaField.Name}.*") ;
foreach(var childSchemaField in childFrame.Schema().Fields) {
df = df.WithColumn($"{parentSchemaField.Name}.{childSchemaField.Name}",Col($"{parentSchemaField.Name}.{childSchemaField.Name}")) ;
}
df = df.Drop(parentSchemaField.Name) ;
}
}
//create the computed year, month, and day columns for partitioning
df.CreateOrReplaceTempView("tmpTable") ;
var dfparquet = spark.Sql("SELECT *, SUBSTRING(CreateDate,1,4) AS Year, SUBSTRING(CreateDate,6,2) AS Month, SUBSTRING(CreateDate,9,2) AS Day FROM tmpTable") ;
dfparquet.Write().PartitionBy("SellingOrganizationId","Year","Month","Day").Mode(SaveMode.Overwrite).Parquet("abfss://mothership@<myaccount>.dfs.core.windows.net/Transactions") ;
This works just fine you you use Power BI Desktop to load the data but you can only use the top most partition. in this case SellingOrganizationId=value
Solved! Go to Solution.
I am going to reply to my own question and hopefully it will close this issue and help someone else.
Anyway the issue was that I could not get it to work in a dataflow and even if I did I doubt that Query Folding would work.
So the alternative was to use serverless sql in Azure Synapse.
This all worked fine until I discovered that you cannot refresh CETAS tables.
But it you do everything else the same and rather than create a CETAS table you create a view using OpenRowset then
"Bobs You Uncle".
Appending additional data to the underlying ADLS Gen 2 container as Year=xxxx/Month=xx/Day=xx works as expected. I can see the refreshed data if I am attached using SSMS or if I query in Azure Synapse. I still have to refresh the DataFlow and for now until I can test using a Computed Engine I have to also refresh the DataSet but it does work in a DataFlow as expected.
Now I need to test using Lake Database tables to determine if they see appended data automatically.
I am going to reply to my own question and hopefully it will close this issue and help someone else.
Anyway the issue was that I could not get it to work in a dataflow and even if I did I doubt that Query Folding would work.
So the alternative was to use serverless sql in Azure Synapse.
This all worked fine until I discovered that you cannot refresh CETAS tables.
But it you do everything else the same and rather than create a CETAS table you create a view using OpenRowset then
"Bobs You Uncle".
Appending additional data to the underlying ADLS Gen 2 container as Year=xxxx/Month=xx/Day=xx works as expected. I can see the refreshed data if I am attached using SSMS or if I query in Azure Synapse. I still have to refresh the DataFlow and for now until I can test using a Computed Engine I have to also refresh the DataSet but it does work in a DataFlow as expected.
Now I need to test using Lake Database tables to determine if they see appended data automatically.