Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
I need to import all rows of a tablepress (wordpress plugin) page, about 200 rows. The page shows 30 rows each time, but the URL is the same, so, can´t merge multiple datasources. Power BI Desktop gets only the first 30 rows.
The table is on the web source, all rows, but the presentation is 30 rows each time.
Tried to use powershell to get a .html file, but the parsing become wrong (some <br /> inside the columns).
Any hint?
Solved! Go to Solution.
No, it didn't work to us. The point is that with this Wordpress Plugin, the URL doesn't change when you change the page. Our solution (after a lot of search on the web) was to adapt a Powershell script we found, in order to filter only the 'table ... /table" section, cleaning its fields, transforming it on a .CSV file. Unfortunately I can't find the original URL, so here is our script, after adaptations:
$url ="https://some_url/some_page/"
$r=Invoke-WebRequest $url
$r.Content |Out-File C:\temp\some_page.html
$Path = 'C:\temp\some_page.html'
[regex]$regex = "(?s)<table id=.*?</table>"
$tables = $regex.matches((GC $Path -raw)).groups.value
ForEach($String in $tables){
$String = $String|ForEach{$_ -replace '(?s)<br.*?</td>','/<td>'}
$table = $String.split("`n")
$CurTable = @()
#$CurTableName = ([regex]'table id="([^"]*)"').matches($table[0]).groups[1].value
$CurTable += ($table[1] -replace "<.th><th class=.*?>",";") -replace "</?(th.*?|b.*?)>"
$CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace '(?s)<br.*?<br' -replace '(?s)<br.*?<','<' -replace "(`r|`n|`t)" -replace "( )+", " " -replace " " -replace "</t(h|d)><t(h|d) class=.*?>",";" -replace "<\tr>","`n" -replace "<.?t(d.*?|b.*?|r.*?|h.*?)>" }
$CurTable | convertfrom-csv -Delimiter ";" | export-csv "C:\temp\some_page.csv" -notype -Encoding UTF8 -Force
}
Of course, for each different page, probably you will need to change this code.
@Jrussi1301 , refer if this can help
https://www.youtube.com/watch?v=6PZSZ53iSos
https://www.youtube.com/watch?v=s_5Jk8_fHjA
https://www.myonlinetraininghub.com/scrape-data-multiple-web-pages-power-query
No, it didn't work to us. The point is that with this Wordpress Plugin, the URL doesn't change when you change the page. Our solution (after a lot of search on the web) was to adapt a Powershell script we found, in order to filter only the 'table ... /table" section, cleaning its fields, transforming it on a .CSV file. Unfortunately I can't find the original URL, so here is our script, after adaptations:
$url ="https://some_url/some_page/"
$r=Invoke-WebRequest $url
$r.Content |Out-File C:\temp\some_page.html
$Path = 'C:\temp\some_page.html'
[regex]$regex = "(?s)<table id=.*?</table>"
$tables = $regex.matches((GC $Path -raw)).groups.value
ForEach($String in $tables){
$String = $String|ForEach{$_ -replace '(?s)<br.*?</td>','/<td>'}
$table = $String.split("`n")
$CurTable = @()
#$CurTableName = ([regex]'table id="([^"]*)"').matches($table[0]).groups[1].value
$CurTable += ($table[1] -replace "<.th><th class=.*?>",";") -replace "</?(th.*?|b.*?)>"
$CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace '(?s)<br.*?<br' -replace '(?s)<br.*?<','<' -replace "(`r|`n|`t)" -replace "( )+", " " -replace " " -replace "</t(h|d)><t(h|d) class=.*?>",";" -replace "<\tr>","`n" -replace "<.?t(d.*?|b.*?|r.*?|h.*?)>" }
$CurTable | convertfrom-csv -Delimiter ";" | export-csv "C:\temp\some_page.csv" -notype -Encoding UTF8 -Force
}
Of course, for each different page, probably you will need to change this code.
User | Count |
---|---|
140 | |
113 | |
104 | |
77 | |
65 |
User | Count |
---|---|
135 | |
116 | |
101 | |
71 | |
61 |