Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
puneetvijwani
Resolver IV
Resolver IV

V-Order Functionality in Fabric Notebook

I've been experiencing an issue with my scripts over Fabric Notebooks in terms of optimization . I've been trying to control the V-Order functionality in data loading processes. According to the documentation and examples provided, it should be feasible to control the Parquet V-Order at the DataFrame level using the parquet.vorder.enabled option and if the V order option is disabled at session level i can control this by additional syntax…. I am following this article.

https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparks...

epunvij_0-1691859175838.png

However, attempt to execute this is unsuccessful. I’ve written script to write data to the Delta format with the V-Order alternately enabled and disabled. But despite this, both instances in the script seem to be executing with V-Order disabled for both of them in a single session
Here are the relevant sections of script:

epunvij_1-1691859210488.png

epunvij_2-1691859221880.png

 

We've also looked through our delta logs for both files in Tables and in Files folder, but there are no traces of the Vorder tag to be found, which is puzzling and contrary.

The mentioned behaviour contradicts the following examples we found in the official documentation saying this might work if V order is set to false at session level .


df_source.write\

  .format("delta")\

  .mode("overwrite")\

  .option("replaceWhere","start_date >= '2017-01-01' AND end_date <= '2017-01-31'")\

  .option("parquet.vorder.enabled ","true")\

  .saveAsTable("myschema.mytable")

 

With the confusion I thought I could use some community  guidance ,Wondering any additional configurations or prerequisites to control V-Order by disabling at session level and manually control it when writing the file  or anything I  may have missed?

10 REPLIES 10
puneetvijwani
Resolver IV
Resolver IV

@DennesTorres Thanks @DennesTorres  for confirming 
I am talking to someone from MS who now looking in to this 

puneetvijwani
Resolver IV
Resolver IV

@DennesTorres  to me it seems like a bug , when tables are not optimized at session level while they are at dataframe write level ( by adding  "parquet.vorder.enabled ","true")  even after that there seems no signs that it has been optimized in metadata/ logs not even at the time of its first creation
After doing  checks at  _delta logs and in the metadata and even table properties

There is no v order which has been present after checking by all three methods , i

am not sure if any other method is also availiable, if not then to me it sounds more like a bug 

Hi,

 

I tried your script. It will help me a lot in other purposes, but I confirm your result: If the optmization doesn't happen on session level there is no sign of any metadata pointing to the optmization.

To be absolutely sure about the result, I completed a test on the file level as well. The original article mentions only the parquet files affected by the write operation would be optmized. 

But even on file level, there is no metadata pointing to any optmization.

The script I tried, a variation from yours, is located below. What's the link to the issue you registered?

import pyarrow.dataset as pq

def print_metadata(delta_file_path):
   
    # Print schema metadata
    print("\nSchema Metadata:")
    print("--------------------")
    schema_metadata = pq.dataset(delta_file_path).schema.metadata
    if schema_metadata:
        for key, value in schema_metadata.items():
            print(f"{key.decode('utf-8')}: {value.decode('utf-8')}")
    else:
        print("No schema metadata found.")
   
    # Test the function with a path to your delta file#

full_tables = [
    'part-00000-cea68c9b-4b96-4955-a1b4-1147b89e8a15-c000.snappy.parquet',
    'part-00001-9d86a807-c4b6-4c33-897e-eeedcfed4ca9-c000.snappy.parquet',
    'part-00002-694dfef3-bea1-4c8e-99bb-8c45a926e413-c000.snappy.parquet',
    'part-00003-dcbf684b-a2b0-4b20-8a63-c962f234e674-c000.snappy.parquet',
    'part-00004-ecb43c9b-1ffb-46a9-9289-50ac5d02683f-c000.snappy.parquet',
    'part-00005-dd4030dc-00be-43bd-873f-374393545d8a-c000.snappy.parquet'
    ]

for table in full_tables:
     print_metadata('//lakehouse/default/Tables/dimension_city/' + table)
 
Kind Regards,
 
Dennes

Hi,

But or missing feature, I don't know. When the optimization is on session level, it's strange to me that the 'V-ORDER' appears only as TAG, a property not intended to mean something so important as the optmization format of the file (I think so, what are your thoughts?).

 

When optimizing on the write level, the optmization is designed to affect specific parquet files, so it's not included on the TAG of the table and since it was not included on metadata anyway, we are left with no way to confirm the optmization. There is some logic, but it's still a missing point.

We end up with many questions: Why does it only appear on the TAG?

How to identify the optimization on individual parquet files?

What about OpmizeWrite, for example, which appears no where ?

Kind Regards,

Dennes

@DennesTorres based on this official doc i can say its indeed  optimization at parquet level files 
https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparks...

 

When i turn them on on session level i see them applied on parquet files metatata using pyarrow ( above code)

epunvij_0-1691922829527.png

 

Also as we see tag in delta logs level as well ..my guess tag in delta log would be there for some reason 

the accurate way i understad is to check at the parquet's metatadata level , however in my mentioned issue when i turn them off at session level and try to control at dataframe write level i do not see expected behavior in parquet's metata 

 

puneetvijwani
Resolver IV
Resolver IV

@DennesTorres I checked at metadata level of parquet files using pyarrow and printing dataset level schema for the table using below function .. i still dont see any metatadata related to Vorder ..not sure what i am missing ..can you test this at your end and see do you also seeing same behaviour ??

import pyarrow.dataset as pq

def print_metadata(delta_file_path):
   
    # Print schema metadata
    print("\nSchema Metadata:")
    print("--------------------")
    schema_metadata = pq.dataset(delta_file_path).schema.metadata
    if schema_metadata:
        for key, value in schema_metadata.items():
            print(f"{key.decode('utf-8')}: {value.decode('utf-8')}")
    else:
        print("No schema metadata found.")
   
    # Test the function with a path to your delta file
# print_metadata('//lakehouse/default/Tables/table_name')



Hi,

I will check.

But, this may be related to my previous comment: On the tables optimized by session level configuration, the V-ORDER optimization appears only as a TAG, it doesn't appear as metadata and I don't know exactly why. Would this mean the table is not optimized, or we need a different way to identify it is optmized ?

DennesTorres_0-1691881457333.png

 

When trying the optimization on the write, instead of session level, the TAG doesn't appear as well.

Kind Regards,

 

Dennes

 

DennesTorres
Post Prodigy
Post Prodigy

Hi,

About this:

"But despite this, both instances in the script seem to be executing with V-Order disabled for both of them in a single session"

How do you know the v-order was disabled?

"We've also looked through our delta logs for both files in Tables and in Files folder, but there are no traces of the Vorder tag to be found, which is puzzling and contrary."

Could you give more details about this?

 

I have been working on a related challenge: How to identify if an existing table was created with v-order enabled or not?

Here is a thread about my investigation: https://community.fabric.microsoft.com/t5/General-Discussion/How-to-list-Table-Properties/m-p/337130...

Another one which may or may not be related: https://community.fabric.microsoft.com/t5/Issues/Workspace-level-boolean-spark-configurations-appear...

Kind Regards,

 

Dennes

@DennesTorres 
i am checking in one lake explorer mannualy _delta_log files if VOrder Tag is present and is set to true 

You can also write spark code to check the delta log details from notebook 
here is my code to read metadata of the table 

 

tablebasepath="your path to the table "

tablename=f'{tablebasepath}/_delta_log'
# Get a list of all JSON files in the _delta_log directory
log_files = [file.path for file in mssparkutils.fs.ls(tablename) if file.name.endswith(".json")]
# Check if there are any log files
if log_files:
    # Read the first log file
    data = mssparkutils.fs.head(log_files[0])
   
    # Print the contents of the file
    print(data)
else:
    print("No log files found.")

However thanks for redirecting me to the issues page i should rather post this is an issue

Hi,

 

I tested your script. All the tables I have were marked with V-ORDER, but what caught my attention was that the V-ORDER was only a TAG, it was not on the metadata. Is this correct?

I managed to turn off V-ORDER optimization on session level and the TAG disappeared. But after that I repeated your results: Trying to enable vorder for one specific write operation doesn't bring the TAG back.

One detail about this and the article you are using is one mention to the fact that the write configuration will affect only the parquet files involved in the operation, not the entire table. Could this result in a table with mixed parquet files, some with the optimization and some not? Could this explain why the V-Order doesn't appear on table level ?

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Kudoed Authors