Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hello ,
Previously i posted this question : Solved: Re: Moving values from columns - Microsoft Power BI Community which was kindly replied from @Jimmy801 .
It was fine until i've faced new challenges from my PDF tables sources, now i'm stuck again.
This time i've put a fraction of my original pdf archives here: https://1drv.ms/u/s!BJPuttHgFDRRgY1myLNkrZm5G__Rlg?e=MIiQ9K
I did some small cleaning with this code:
let
Fonte = Folder.Files(path to onedrive folder),
#"FiltraPDF, APRM e MULTAS OUTORG" = Table.SelectRows(Fonte, each Text.EndsWith([Name], "pdf") and not Text.Contains([Name], "APRM") and not Text.Contains([Name], "MULT")),
#"Personalização Adicionada" = Table.AddColumn(#"FiltraPDF, APRM e MULTAS OUTORG", "Personalizar", each Pdf.Tables([Content])),
#"Personalizar Expandido" = Table.ExpandTableColumn(#"Personalização Adicionada", "Personalizar", {"Id", "Name", "Kind", "Data"}, {"Id", "Name.1", "Kind", "Data"}),
#"Linhas Filtradas3" = Table.SelectRows(#"Personalizar Expandido", each ([Kind] = "Table")),
#"Personalizar.Data Expandido" = Table.ExpandTableColumn(#"Linhas Filtradas3", "Data", {"Column1", "Column2", "Column3", "Column4", "Column5"}, {"Data.Column1", "Data.Column2", "Data.Column3", "Data.Column4", "Data.Column5"}),
#"Removed Other Columns" = Table.SelectColumns(#"Personalizar.Data Expandido",{"Name", "Data.Column1", "Data.Column2", "Data.Column3", "Data.Column4", "Data.Column5"}),
#"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([Data.Column3] <> "SALDO ANTERIOR" and [Data.Column3] <> "Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior" and [Data.Column3] <> "TOTAL") and ([Data.Column2] <> "SALDO ANTERIOR" and [Data.Column2] <> "Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior" and [Data.Column2] <> "TOTAL" and [Data.Column2] <> "^{( - )} Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior") and ([Data.Column1] <> "TOTAL"))
in
#"Filtered Rows"
As i said in the original topic, although those pdf are table structured, they dont follow the exactly same structure.
i.e. Data.Colum3, Data.Colum4 e Data.Colum5 have financial values, although only one column of this type is desirable.
Also some columns were split in, but should have been merged:
But i cant just merge them because there are values in Data.Column3, so merging them would result in a undesirable results.
OUTPUT DESIRED (just a small fraction as example!😞
CBH | Ano | Mes | Atribute | Data Taxa Agente Financeiro | Data Saldo | R$ |
ALPA | 2021 | 2 | null | 28/02/2021 | 136613632 | |
ALPA | 2021 | 3 | null | 28/02/2021 | 136613632 | |
AP | 2021 | 2 | null | 28/02/2021 | 552702984 | |
AP | 2021 | 3 | null | 28/02/2021 | 552702984 | |
ALPA | 2021 | 1 | 01/02/2021 | null | 214775 | |
AP | 2021 | 1 | 01/02/2021 | null | 875524 | |
ALPA | 2020 | 3 | Lançamentos à Crédito | null | null | null |
ALPA | 2020 | 3 | ( + )Rendas de Aplicações Financeiras - Cobrança | null | null | 45935 |
ALPA | 2020 | 3 | ( + )Aportes Secretaria | null | null | 0 |
ALPA | 2020 | 3 | ( + )Rendimento Repassado pelo Tomador | null | null | 0 |
ALPA | 2020 | 3 | ( + )Devolução de parcelas - Contratos Não Reembolsáveis | null | null | 0 |
ALPA | 2020 | 3 | ( + )Pagamento de parcelas - Contratos com Retorno | null | null | 0 |
ALPA | 2020 | 3 | ( + )Crédito de Cobrança | null | null | 2428 |
ALPA | 2020 | 3 | Lançamentos à Débito | null | null | null |
ALPA | 2020 | 3 | Liberação de parcelas para contratos não reembolsáveis | null | null | 0 |
ALPA | 2020 | 3 | Liberação de parcelas para contratos reembolsáveis | null | null | 0 |
ALPA | 2020 | 3 | Repasses para Cobertura de Custos Operacionais | null | null | 0 |
ALPA | 2020 | 3 | Repasses sobre Valores Arrecadados | null | null | 0 |
ALPA | 2020 | 3 | Restituição de valores cobrados pelo uso da água | null | null | 0 |
ALPA | 2020 | 3 | Resgate para transferência ao DAEE ref.Pagto. de tarifas de cobrança | null | null | 0 |
The first 3 columnns are base on the PDF archive name :
RELATORIO COBRANCA ALPA_2020_03.pdf |
ALPA = CBH
2020 = ANO (YEAR)
03 = MES (MONTH)
Data Taxa Agente Financeiro
Based on rows with "Taxa de Administração do Agente Financeiro cobrada em". Note that are rows without the space between "em" and the date itself, like in the last examples below.
Data Saldo
Based on rows with "SALDO EM 'dd/mm/yyyy'". Those values are also presented in more than one column
Atribute
Are the main atributes, which i'll do the calculation with the "R$" column
R$
The values itself, expressed in brazilian reais (BRL)
_____________
I thought i had completed this task within this code:
let
Fonte = Folder.Files("one drive link"),
#"FiltraPDF, APRM e MULTAS OUTORG" = Table.SelectRows(Fonte, each Text.EndsWith([Name], "pdf") and not Text.Contains([Name], "APRM") and not Text.Contains([Name], "MULT")),
#"Personalização Adicionada" = Table.AddColumn(#"FiltraPDF, APRM e MULTAS OUTORG", "Personalizar", each Pdf.Tables([Content])),
#"Personalizar Expandido" = Table.ExpandTableColumn(#"Personalização Adicionada", "Personalizar", {"Id", "Name", "Kind", "Data"}, {"Personalizar.Id", "Personalizar.Name", "Personalizar.Kind", "Personalizar.Data"}),
#"Linhas Filtradas3" = Table.SelectRows(#"Personalizar Expandido", each ([Personalizar.Kind] = "Table")),
#"Personalizar.Data Expandido" = Table.ExpandTableColumn(#"Linhas Filtradas3", "Personalizar.Data", {"Column1", "Column2", "Column3", "Column4", "Column5"}, {"Personalizar.Data.Column1", "Personalizar.Data.Column2", "Personalizar.Data.Column3", "Personalizar.Data.Column4", "Personalizar.Data.Column5"}),
#"PADRAO ""COBRANCA"" NO NOME" = Table.ReplaceValue(#"Personalizar.Data Expandido","COBR_","COBRANCA ",Replacer.ReplaceText,{"Name"}),
#"PADRAO ""COBRANCA"" NO NOME2" = Table.ReplaceValue(#"PADRAO ""COBRANCA"" NO NOME","COBRANÇA","COBRANCA",Replacer.ReplaceText,{"Name"}),
#"Removed Other Columns" = Table.SelectColumns(#"PADRAO ""COBRANCA"" NO NOME2",{"Name", "Personalizar.Data.Column1", "Personalizar.Data.Column2", "Personalizar.Data.Column3", "Personalizar.Data.Column4", "Personalizar.Data.Column5"}),
#"Filtra linhas inúteis" = Table.SelectRows(#"Removed Other Columns", each ([Personalizar.Data.Column2] <> "SALDO ANTERIOR" and [Personalizar.Data.Column2] <> "TOTAL") and ([Personalizar.Data.Column1] <> "AG. 1897-X - SETOR PÚBLICO SÃO PAULO / MSE" and [Personalizar.Data.Column1] <> "FEHIDRO COBRANÇA" and [Personalizar.Data.Column1] <> "RESUMO DA MOVIMENTAÇÃO" and [Personalizar.Data.Column1] <> "TOTAL" and [Personalizar.Data.Column1] <> "[image]")),
Custom3 = #"Filtra linhas inúteis",
Custom2 = Table.ColumnNames(Custom3),
#"Troca Null por Blank" = Table.ReplaceValue(Custom3,null,"",Replacer.ReplaceValue,Custom2),
Custom1 = Table.AddColumn(#"Troca Null por Blank", "CreateNewTable", each [1 = if Text.StartsWith([Personalizar.Data.Column1],"Taxa") then [Personalizar.Data.Column1] else [Personalizar.Data.Column2],
2 = if Text.StartsWith([Personalizar.Data.Column1],"Taxa") then [Personalizar.Data.Column2] else [Personalizar.Data.Column3],
3= if Text.StartsWith([Personalizar.Data.Column1],"Taxa") then [Personalizar.Data.Column3] else [Personalizar.Data.Column4],
4 = if Text.StartsWith([Personalizar.Data.Column1],"Taxa") then [Personalizar.Data.Column4] else [Personalizar.Data.Column5]
]),
#"Expanded CreateNewTable" = Table.ExpandRecordColumn(Custom1, "CreateNewTable", {"1", "2", "3", "4"}, {"CreateNewTable.1", "CreateNewTable.2", "CreateNewTable.3", "CreateNewTable.4"}),
#"Removed Other Columns2" = Table.SelectColumns(#"Expanded CreateNewTable",{"Name", "CreateNewTable.1", "CreateNewTable.2", "CreateNewTable.3", "CreateNewTable.4"}),
#"Merged Columns1" = Table.CombineColumns(#"Removed Other Columns2",{"CreateNewTable.3", "CreateNewTable.4"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged")
//= Table.AddColumn(Source, "CreateNewTable", each [FirstColumn = if Text.StartsWith([A],"Taxa de Administração do Agente Financeiro cobrada em") then [A] else [B],
//SecondColumns = if Text.StartsWith([A],"Taxa de Administração do Agente Financeiro cobrada em") then [B] else [C]])
,
#"É numero?" = Table.AddColumn(#"Merged Columns1", "isnumber?", each try Number.From([Merged]) otherwise "", type number),
#"eh texto?" = Table.AddColumn(#"É numero?", "isText", each try Text.Contains([CreateNewTable.2], "a") otherwise ""),
#"eh texto 2" = Table.AddColumn(#"eh texto?", "isText2", each if ([isText]) = true then [CreateNewTable.1] & "" & "" & [CreateNewTable.2] else [CreateNewTable.1], type text),
#"Removed Other Columns1" = Table.SelectColumns(#"eh texto 2",{"Name", "isnumber?", "isText2"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Other Columns1",{"isText2", "isnumber?"}),
#"Renamed Columns" = Table.RenameColumns(#"Reordered Columns",{ {"isText2", "Atributo"}, {"isnumber?", "R$"}}),
#"Tira o .PDF" = Table.TransformColumns(#"Renamed Columns", {{"Name", each Text.BeforeDelimiter(_, ".pdf"), type text}}),
#"Duplicated Column" = Table.DuplicateColumn(#"Tira o .PDF", "Name", "Name - Copy"),
#"Extracted Last Characters" = Table.TransformColumns(#"Duplicated Column", {{"Name - Copy", each Text.End(_, 7), type text}}),
#"Filtered Rows" = Table.SelectRows(#"Extracted Last Characters", each true),
#"Split Column by Delimiter" = Table.SplitColumn(#"Filtered Rows", "Name - Copy", Splitter.SplitTextByDelimiter("_", QuoteStyle.Csv), {"Ano", "Mes"}),
#"Extracted Text Between Delimiters" = Table.TransformColumns(#"Split Column by Delimiter", {{"Name", each Text.BetweenDelimiters(_, " ", "_", 1, 0), type text}}),
#"Data Saldo" = Table.SplitColumn(#"Extracted Text Between Delimiters", "Atributo", Splitter.SplitTextByDelimiter("SALDO EM ", QuoteStyle.Csv), {"Atributo", "Data Saldo"}),
#"Data Taxa AF" = Table.SplitColumn(#"Data Saldo", "Atributo", Splitter.SplitTextByDelimiter("Taxa de Administração do Agente Financeiro cobrada em ", QuoteStyle.None), {"Atributo", "Data Taxa AF"}),
#"Renamed Columns1" = Table.RenameColumns(#"Data Taxa AF",{{"Name", "CBH"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns1",{{"CBH", type text}, {"Atributo", type text}, {"Data Taxa AF", type date}, {"Data Saldo", type date}, {"R$", type number}, {"Ano", Int64.Type}, {"Mes", Int64.Type}}),
#"Filtra Tx Adm Mes Anterior" = Table.SelectRows(#"Changed Type", each ([Atributo] <> "( - )Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior" and [Atributo] <> "Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior" and [Atributo] <> "^{( - )} Taxa de Administração - Agente Financeiro Cobrança ref. Mês Anterior"))
in
#"Filtra Tx Adm Mes Anterior"
but i still got values (in BRL) in my Atributo Column:
I glad appreciate any help provided.
Cheers
Sorry, but I'm too busy right now and cannot pick this up.
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries
Thanks anyhow. I've learned a lot from posts/blog.
Here you are. This will work for all of the pdfs in your folder:
let
Source = Pdf.Tables(File.Contents("xxxxxxxxxxxxxxxxxxxxx.pdf"), [Implementation="1.2"]),
#"Filtered Rows" = Table.SelectRows(Source, each [Kind] = "Table"),
#"Added Custom1" = Table.AddColumn(#"Filtered Rows", "Columns", each Table.ColumnCount([Data])),
#"Sorted Rows" = Table.Sort(#"Added Custom1",{{"Columns", Order.Descending}}),
#"Removed Other Columns1" = Table.SelectColumns(#"Sorted Rows",{"Id", "Data"}),
#"Expanded Data" = Table.ExpandTableColumn(#"Removed Other Columns1", "Data", {"Column1", "Column2", "Column3", "Column4"}, {"Column1", "Column2", "Column3", "Column4"}),
#"Sorted Rows1" = Table.Sort(#"Expanded Data",{{"Id", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows1",{"Id"})
in
#"Removed Columns"
--Nate
Hi Nate,
Still not working.
I need something to deal with a folder path, not a file, hence 'File.Contents' it's not the best alternative since there will be many pdf files.
Anyhow, i've tested with your query, and the problems are still appearing.
Check for instance with this file: RELATÓRIO COBR_BPG_2020_01.pdf
Red line circle: can't have values in the same colum as my atributes
Blue circle: SALDO EM with the date
Green line: "Taxa de administração..." is an atributte. Should be in the correct column, as i said earlier..
and this happened checking only one file... probably there would be many more when considering the other files, since the pdfs are unstructured.
Glad appreciatte your efforts, though
Anyone ? This is driving me nuts!
Thanks...
I would actually improve upon the first solution to make it more dynamic:
NewColumn1= Table.AddColumn(PriorStep, "Fixed", each if [A] = null then [B] else [A]),
NewColumn2 = Table.AddColumn(NewColumn1, "Fixed2", each if [C] = null then [B] else [C])
Then remove your original columns.
As far as your table splitting issue, I would add a space after each em by using replace values "em", "em ". That way you can split by the last space instead of by that whole text string.
--Nate
Hi there.
I'm afraid this solution did not work as expected. There are still "values" fields (R$) where it doesnt suppose to be....
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.