Categories
Data Microsoft Power Platform Power BI

Stars, dimensions, and facts

Despite the title, this is not a sci-fi post. However, if you are new to data modeling and Power BI, it may seem futuristic to you.

In this post, I will explain the concept of star schema data modeling and it’s related concepts dimension- and fact tables. Setting up a smart star schema design will make your data exploration and dashboard building so much easier. In order to do that, we need to know what fact- and dimension tables are and how they are intended to be used.

What do we mean by fact- and dimension tables?

When I started out with data modeling in Power BI, the hint to make a distinction between dimension and fact tables was exactly the type of hint I discarded as too technical and complicated and not necessary for what I wanted to build. It’s neither of those things. Using a star schema data model is easier from a user perspective, and almost always applicable.

Instead, if I would have invested a few minutes into understanding why a data model is different and (sometimes) better than my Excel spreadsheets, it would have saved me a lot of time down the line.

So let’s start by having common terminology.

A dimension table is a table about something that we want to explore. As an example, a dimension table can tell you about an entity’s place in a hierarchy, group together a handful of entities, or relate a date in a longer timeline.

A fact table contains occurances of the things we want to explore. Each row contains information of a single event or transaction, that is further described in the dimension tables.

The fact table contains information that is unique to the transaction, while the dimension table holds information that is shared and compared.

Take a financial model where we want to track actuals on a P&L based on data from our accounting system. The fact table would contain each transaction made in the accounting system, maybe one row per journal entry. We could perhaps store information on the amount, the currency, the cost/revenue center, the date, the counterpart, and the account of the transaction. All things unique to the transaction.

What’s not unique to the transaction is how an account relates to the CoA hierarchy – which accounts sums up to which P&L-row. Or which counterparts that are preferred suppliers and which that are secondary or worse choices. This type of information is suitable in a dimension table.

Why should I build a star schema?

Increase performance – reduce data

If you only have one fact table, your model could just be one wide (with many columns) table. But this may be a bad idea, despite being a fully functional solution. That’s because using dimensions may drastically reduce the data in your model.

Let’s expand on our P&L-example. Let’s say we have 1000 transactions on 10 different accounts. We can either put the sum-level information in the fact table, or in a related accounts dimension table. If we put the information in the fact table, we need an additional column with information for all transactions. That’s 1000 additional cells of information. Instead, if we have a table with the sum level for each account and relate the fact- and the dimension table on the account key, we only get 10 new rows – one for each account. Consider that your data sets may grow to millions of rows, and you’ll start to understand why smart data modeling is key for dashboard performance.

Avoid many-to-many relationships

If you have multiple fact tables that you somehow want to compare, you need to use dimension tables to avoid many-to-many relations. Power BI allows for many-to-many relations, but it’s clearly stated that if you don’t fully understand how these relationships works, you risk getting erroneous output.

In our finance example above, one may argue that actuals without context is useless. Perhaps you want to compare the actuals to a plan or a budget. This plan is also suited for a fact table, as it contains information across multiple dimensions – maybe a forecasted outcome for an account on a cost center a given month.

If we want to compare actuals to the plan for an account on a cost center, we need to tell Power BI how these relate. If we make a direct relationship between the cost center column in the plan-table and the actual-table, we will get a many-to-many relationship, and likely get outputs in the dashboard that are wrong. But the cost centers in both tables relate to the same cost center hierarchy, where each cost center only occur once. So we can put this hierarchy as a dimension table, and let both fact tables have many-to-one relationships to this dimension table. Then, Power BI knows how the two fact tables relate to each other, and we can add additional information to the dimension table like sum-levels or cost center owners with little additional data.

Suite yourself

Now, the applications and underlying data always varies. These best practice rules may very well be too generalized for your specific problem, so the star schema may not work for you. But you should know the rules and know when you deviate from them.

How do I build a star schema?

If you want some practical tips on how to transform your data to a star schema data model, I’d recommend the video below from Guy In a Cube. It also shows hands-on on how to use the Power Query functionality in Power BI which I always recommend to use.

Power BI Tutorial – From Flat File to Data Model

Summary

  • Smart data modeling will always make your dashboards faster and easier to work with. Moreover, if you are using multiple fact tables, dimension tables almost becomes a necessity.
  • When building a star schema, you have fact and dimension tables. A fact table contains transaction-specific information, and a dimension table contains information common for a certain aspect of multiple transactions.
  • Even though star schema modeling is best practice, your specific needs may require something else. However, you need to know the rules to break the rules.
  • One can easily build a star schema data model from a single flat file using the Power Query Editor functionality in Power BI. A great way to improve efficiency of your dashboards by reducing data, and simplify development by having clearer tables.

Some additional reading

Categories
AI Azure Power BI

Text analytics in Power BI

Microsoft is continously adding new AI functionality to the Power Platform. Here is a brief look on how to use the text analytics function to detect the language and overall sentiment in a body of text.

This can be very useful when analysing interactions with customers or suppliers, the mood among team members and company perception among financial analysts or in the media.

This function is part of Azure Cognitive Services so you will need to create an Azure account to try this yourself.

In the example below I am using a table with data about three newspaper articles. Two are in Swedish, one is in Danish. One of the Swedish articles is regarding a win in a horse racing event, the other is about the Coronavirus. The Danish article is about recent acts of terrorism in France.

I have opened the Power Query Editor and marked out the ‘AI insights’ group under the Home tab. Today I am going to use the ‘Text Analytics” and the functions ‘Detect language’ and ‘Score sentiment’.

I am then prompted to sign in to my AI functions Azure account.

As the Score sentiment function is helped by pre-defining the language of a text, we will start with detect language. Simply select the column of the text you want do detect the language for.

After a brief moment (larger tables will take longer time) two new columns are returned. The first shows the detected language and the second the corresponding ISO code for that language code for that language (this column will be used next in the ‘Score sentiment’ function.

Now let’s try the ‘Score sentiment’ function. Choose the text column to analyse. As you can see there is an option to assist the model by providing a pre-defined language classification either as single two letter ISO code or as a column from the table. The second option is of course more useful if there are multiple languages in the table.

After pressing OK the function will return a new column with a score of the sentinment in respective texts. The score range is between 0 and 1, with 1 being the highest (most positive) sentiment. Usually most texts score between 0.4 and 0.6.

The results gave the article about horse racing the highest score (0.5366) and the article about the coronavirus the lowest score (0.4877).

Please note that the Danish article on terrorism is almost as high as the Swedish article on horse racing, whcih could seem counter-intuitive given the differing subject of the articles. I have also seen this pattern when using this function on other texts of different languages, where it seems different languages end up on their own language specific range, which differs from other languages. This it leads me to conclude that the function is not yet suited for comparison across different languages. However, it is very useful for comparison within languages and we can expect it will improve quickly over time.

For a detailed tutorial on the third text analytics function ‘Detect Key Phrases, check out this tutorial from Microsoft.