Power BI gateways serve in an important role - they bridge the data gap between the Azure service that drives Power BI online offerings like browser based reports and dashboards, or user access to XMLA endpoints for example through "Analyze in Excel".
Without gateways the Azure service would not know how to talk to your on-premise data sources (like SQL server in your company network). Gateways receive requests from the Azure service, process them, and return the results to the service.
If you have a personal gateway, or a single instance of an Enterprise gateway then live dangerously - your personal gateway may be installed on a PC that is not always on, or your single instance Enterprise gateway might fall offline for a variety of reasons.
If you are concerned about business continuity (and you should be) then you want a gateway cluster with multiple gateway instances (ideally in different geographical locations) combined together into a gateway cluster. That way if one of the instances falls over, the other instances can continue to serve the data requests without much impact.
Setting a cluster up is one thing, measuring its efficiency is something else entirely. How do you know if a gateway is part of your slow refresh problem or not? Can the gateway handle all the requests for refresh during a regular week? If so, can it do so barely, or with plenty or capacity to spare?
One way to look at this is to examine the gateway logs. In the default configuration these logs are collected automatically both for general information and for errors. The info logs tend to be rather verbose so the gateway only keeps them around for a day or so. Errors are (hopefully) less frequent and you may see two weeks or more of logs.
Logs are stored in the "appdata" folder of the account running the gateway service.
You can also enable advanced logging (you need to restart the gateway service for that) for even more verbose logging and even shorter retention periods. You only want to do that for short periods of time when asked by Microsoft to provide gateway logs as part of the investigation of a ticket you raised. Clicking on the "Export Logs" link will basically pack up all the text files into a ZIP archive.
Giving the log data to Microsoft is not optimal though. They are investigating the issues with one of your dataset refreshes but you may have hundreds of other refreshes going on that are unrelated (and potentially contain senistive information). Collecting the gateway logs manually also gets old very quickly if you have more than just one or two instances (as you should, for business continuity).
What if we could examine these log files manually? Can't be that hard, right. These are just text files after all.
This is where it gets ugly, and quick. Log files are more like a bunch of stream of consciousness flow of random thoughts scribbled onto a napkin on a wet table in a stiff breeze.
Well, maybe not exactly, but as you can see these log files certainly lack one thing - structure. And they have a lot of dead weight, rows and rows of text that seemingly have no information.
There is no documentation (that I know of) on the structure of the log files. From my observations they can have as few as a single column and as much as eight columns of data, with a lot of these columns filled with seemingly meaningless identifiers for processes and subprocesses.
There are some identifiable pieces - timestamps, message types, and the occasional clear text of an error.
The best I could come up with is a Power Query franken-script to try and tame the data and only let the information through. I am pretty sure that I am misinterpreting some of the data, and missing other important items.
Here is the script ("GetLog") to process an individual text file. Note that I chose to add an index to get some idea of the sequence of events as often there are many events with the exact same timestamp. Also note that the actual useful information can either be in column 1 or column 8 so I add an "action" column that pulls the data accordingly.
Now we can run reports about gateway cluster utilization over time, and we can validate that all gateway cluster members do their part in processing the service requests.
Here is an example of the on-ramp of requests against the four cluster members of one gateway on Monday. Note that the primary (anchor) cluster member failed to restart on Saturday. The other three cluster member continued on, and on Monday I swapped in a new primary member.
Gateway error logs help to identify connections where either the credentials have expired, or where the data source is offline.
I have found that gateways distribute data source refresh requests over multiple cluster members, even for the same dataset. That's actually pretty impressive. After lots of searching I also finally found how to connect the gateway logs back to our tenant audit logs - the dataset Id is included in the "objectId" rows of the log files. Adding a calculated column
Dataset = right(left(GatewayInfo[Action],51),36)
then allows me to tie the logs into the data model.
Now I can directly identify the developer I need to work with to correct a connection issue, or any excessive refresh requests that consume all the gateway resources.
None of the above is simple. I wish that in the future Microsoft would provide more information on the structure of the gateway logs, more controls on what to log, and ideally some real time gateway performance monitoring tools that would make the described process obsolete.