Inconsistent CSV Update/Overwrite with Python

Anonymous · ‎10-03-2019

Context: my goal is to have two types of Excel files/CSVs:

Current: Always have updated information (many files)
Ledger: A log of past information from the above files (using date modified as date) (one file)

I got this to happen once by doing the following:

Checked out this site: https://www.thebiccountant.com/2018/12/29/export-data-from-power-bi-to-csv-using-python/
Pulled in Current files and combined in PowerBI
Pulled in Ledger
In python, checked if Current files rows were in Ledger already
1. If they weren't, added to Ledger in PowerBI (not yet on my computer)
Wrote Ledger to same filename and filepath as original Ledger (overwrite, now on my computer) using similar powerQuery and python code as the link above.

I know I got it to happen because my Ledger originally only had one Test row, and it now additionally has every Current row.

HOWEVER, it will not happen fully again. Here is what's up:

I change one line of one file in Current files
This DOES change that line from my direct PowerBI pull. Current Files combined is updated
This DOES NOT change the output of my python code, the Ledger
Therefore, the Ledger on my computer is overwritten by the same file

Here is my python code for reference, with some extra notes:

import pandas as pd
import numpy as np
import datetime

log = dataset


def un_id(series):
Returns the Unique ID of a row
  return ((str(series["Therapist"]) + "_" + str(series["Agency"])+ "_" 
           + str(series["Provider Office"])).replace(" - ", "-").replace(" ", "_"))

Causes the log (Ledger) table and the prov (Current combined) table to both have IDs
log.loc[:, "ID"] = log.apply(un_id, 1)
prov.loc[:, "ID"] = prov.apply(un_id, 1)

Dates were taken in as strings, changes them to timestamps
log.loc[:, "Date"] = log["Date"].map(pd.to_datetime)
prov.loc[:, "Date"] = log["Date"].map(pd.to_datetime)

Checks if a row [in prov/Current combined] is in Ledger. Returns the series if it is
def not_in_log(series):
  #If series is not in log, returns series.
  #Else, returns nothing.
  Finds rows in Ledger with the same Unique ID as in prov/Current combined
  matches = log[log["ID"] == series["ID"]]
  If none of these rows has the same date as the prov/Current combined Date, returns that row
  if series["Date"] not in list(matches["Date"]):
    return series

Goes through all prov/Current combined rows. Returns the ones that are not in Ledger
new_rows = prov.apply(not_in_log, 1)
new_rows = new_rows.dropna(how="all")
new_rows = pd.DataFrame(new_rows)

Adds new rows to the Ledger
log = log.append(new_rows, ignore_index = True)

Makes sure the Ledger has the right column names
log = log.loc[:, list(prov.columns)]

Thanks in advance for the help!

v-juanli-msft · ‎10-06-2019

Hi @Anonymous

It seems the first goal is achieved with your code.

Current: Always have updated information (many files)
Ledger: A log of past information from the above files (using date modified as date) (one file)

The second one means to store historic data in files, these historic files should not override.

If so, please refer to a brilliant article: Storing and using information from a dynamic data source using PBI desktop.

Best Regards
Maggie

Community Support Team _ Maggie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous · ‎10-07-2019

Thanks for the pointer!

I could certainly run my python in a script that I time using task scheduler; in this scenario, I wouldn't even need to go through PowerBI for the updates.

However, because of the data hosting/security context I'm in, it is preferable that everything stays in PowerBI (and OneDrive, where the files are stored).

I'm confused because I was able to get the process to happen once (the initial ledger "fill") but it won't happen a second time to udpate.

I've also gotten this message which I don't know how to fix, and I think it has something to do with security (filepath removed for privacy):

DataSource.Error: ADO.NET: Python script error.
Traceback (most recent call last):
  File "PythonScriptWrapper.PY", line 17, in <module>
    dataset.to_csv(path, mode = 'w', index = None, header = True, quoting = csv.QUOTE_ALL, chunksize = None, decimal= '.')
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 3228, in to_csv
    formatter.save()
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\csvs.py", line 183, in save
    compression=self.compression,
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\common.py", line 399, in _get_handle
    f = open(path_or_buf, mode, encoding=encoding, newline="")
PermissionError: [Errno 13] Permission denied: 'FILEPATH/Provider_Log.csv'

Details:
    DataSourceKind=Python
    DataSourcePath=Python
    Message=Python script error.
Traceback (most recent call last):
  File "PythonScriptWrapper.PY", line 17, in <module>
    dataset.to_csv(path, mode = 'w', index = None, header = True, quoting = csv.QUOTE_ALL, chunksize = None, decimal= '.')
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 3228, in to_csv
    formatter.save()
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\csvs.py", line 183, in save
    compression=self.compression,
  File "C:\Users\kwilliams\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\common.py", line 399, in _get_handle
    f = open(path_or_buf, mode, encoding=encoding, newline="")
PermissionError: [Errno 13] Permission denied: FILEPATH/Provider_Log.csv'

    ErrorCode=-2147467259
    ExceptionType=Microsoft.PowerBI.Scripting.Python.Exceptions.PythonScriptRuntimeException

Inconsistent CSV Update/Overwrite with Python

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly