Readership in Seattle by Decisive Data
08-02-2017 15:11 PM
08-02-2017 03:11 PM
Getting access to the data that matters to you, when you need it, can be of the upmost importance. There can be information that is too crucial to not have on-the-fly access. Power BI gives you the capability of retrieving mission critical information, at your finger-tips, with DirectQuery.
My goal is to demonstrate the Power within Power BI via:
- A multi-million row dataset, demonstrated in a simple, infographic design.
- An interactive and live connection to a hosted SQL Server data solution, utilizing book checkout activity within the King County Library System.
- Power BI and Backend SQL Optimization techniques to improve performance for these kinds of analytics implementations.
I had the desire to make a non-parameter driven report, unlike my previous creation, which was built on pushing dynamic parameters to new heights.
Thus, the idea of building an interactive report inspired by modest infographics appealed to me. I leveraged KPIs and iconography to present all up trending for new book additions and book activity. The design is crafted by five primary genres, Fiction, Other, Fantasy, Nonfiction, Media and their respective subgenres.
Selecting the Data
The City of Seattle’s Open Data Program offers a wide range of interesting datasets. I decided to leverage the Checkouts by Title, which consists checkouts by title for all physical and digital items from 2005 to present. So, the raw data set is rather hefty with more than 25 million rows of data.
Some Issues I encountered:
- The most interesting field was the Subjects field, which had a comma delimited format. This meant that books which were associated with multiple subjects, were combined into a single string. This meant that when I first pulled in the data into Power BI there were over 200,000 distinct Subjects, which was far too granular for me to make any inferences on.
- There were NULLs across the board, which meant that portions of the data didn’t provide the context I wished to derive.
- The raw file was over 30GB, which meant that optimization was going to be of critical importance.
- There was no value for net new books within the King County Library system, despite such a value being an interesting metric to observe.
- The King County Library System offers much more than books for recorded checkouts. I needed to filter my data to only include audiobooks, books, and Ebooks to properly represent readership.
Here are some advanced capabilities within Power BI and SQL Server that I utilized:
- Since the data had over 200,000 Subjects, I used SQL and logic to build my own, column friendly hierarchy. This new hierarchy would make the data much easier to slice in Power BI. I used a CTE and Union to separate my comma separated values in the Subjects field into rows, to give a row of data for every possible associated subject associated with a title. I then found the top 200 subjects within the entire dataset by keep the top 200 counted subjects. I then performed a CASE Statement to categorize the NewSubjects into any of the Top 200 categories ELSE bucket into a ‘Other’ category. The CASE statement captured around 80% of all possible NewSubjects for all 30GB of books as most books were associated with at least one of the top 200 categories. Having a single NewSubjects for each title got me much closer to the groupings I wanted. Then I grouped the NewSubjects into 19 Subgenres, which I then further grouped into 5 major Genres. The 5 Genres gave me the macro look I wanted. All of this development was performed in the SQL Server this data is connected to.
- For performance, I used a columnstore index on my SQL table, to vastly improve queries on the dataset. It provided the speed to insight I needed for proper inference within Power BI. The column store index improved my query optimization substantially, with the SQL querying time decreasing by 4,200% after implementation, which certainly rolled into queries within Power BI.
- This data is connected live to our SQL Server via DirectQuery, meaning I could build visualizations over this very large dataset, where it would otherwise be unfeasible to import all of the data with pre-aggregation. Moreover, the 1 GB dataset limitation does notapply to DirectQuery, making it the clear winner for this massive amount of data.
- I leveraged CTEs again to find the MIN date of a title, and leveraged that logic to perform a LEFT JOIN on the table to flag Net New Titles so I could have a value of 1 ELSE 0 for new books.
- I leveraged the histogram custom visual to find my two checkout distributions.
- Lastly, you cannot naturally link images in Power BI, so I made hyperlinks to the Decisive Data website and social media by leveraging text links within the negative space of the images.
Interesting Findings and Other Thoughts
A great dashboard leads to interesting discoveries within the data. As a an avid reader I found facts that stood out to me, as well as some technical musings.
- The most read genre within the King County Library System is Fiction, by a very large margin.
- The most read book within the King County Library System between 2005 and 2016 is Gone Girl by Gillian Flynn.
- Fantasy Novels have accelerated past the Other genre, this may be due to the popularity of Hollywood’s gravitation towards fantasy genres of all types in recent years, bolstering book checkouts as movie goers read the associated books. That is merely my own supposition.
- E. L James’s Fifty Shades of Grey is the third most read book between 2005 and 2016.
- 99% of all books are checked out less than 100 times in a given month.
- Lastly, another approach to the hierarchy creation within SQL would be return to this dataset and leverage some form of Document Clustering analysis on the Subjects field in R or Python to for unsupervised clustering based on text frequency. I am pursuing Data Science, so this will likely be a subject I touch on in my spare time.
What does readership look like for your favorite author?
11-20-2017 10:34 PM
I can't say for sure but I have achieved a similar outcome by using the Sparkline custom visual available in the store. You can overlay this onto other visuals or it can stand up on it's own.