The OVER function is a very useful function in Spotfire that calculates expressions and aggregations across different groups of data. I’ve previously written about the OVER function, which you can check out here. This post has additional examples using the OVER function, focusing on multiple columns and expressions requiring the Intersect function.
[ninja-popup ID=1465] Click here to download the Spotfire file shown in these examples!![/ninja-popup]
OVER Statements with Multiple columns
Often times you’ll want to evaluate an aggregation or expression over just a subset of your data, and sometimes that subset is defined by multiple columns. For example: “Show me average production during the last 3 months for each region we operate.” This is where the OVER function comes in.
When writing custom a custom expression with the OVER function, in most cases you can list the columns in parentheses behind the OVER function (eg: Average([Production]) OVER [Region]). However, in some cases you’re required to use the Intersect function, which I describe a little further down the page.
I’ve listed several examples below to demonstrate these different methods.
Compare well performance based on completion parameters
Let’s say we’re looking at completion data for a group of oil wells. These wells are geographically located in different Fields (North, South, and West), and each well is completed using either Completion Type A or Completion Type B (Note: “Completion” in this context refers to the processes and operations required after drilling an oil / gas well to start producing the well. It’s the 2nd of 3 main phases: 1. Drilling > 2. Completion > 3. Production). A common performance metric is the 90 Day IP. This is the average oil production during the first 90 days of production history.
Here’s what the data looks like:
We want to know: Does the Completion Type and geographic location (Field) affect the 90 Day IP?
To help answer this question, I’ll average the 90 day IP rate for each combination of Field and Completion Type. In this data, there are 3 Fields and 2 Completion Types, therefore we should end up with 6 unique values of Avg IP:
- Avg IP for Completion Type A in the North Field
- Avg IP for Completion Type A in the South Field
- Avg IP for Completion Type A in the West Field
- Avg IP for Completion Type B in the North Field
- Avg IP for Completion Type B in the South Field
- Avg IP for Completion Type B in the West Field
Here’s the custom expression to calculate this, inserted as a calculated column (Insert > Calculated Column):
Avg([90 Day IP]) OVER ([Completion Type], [Field])
Make sure to include the parentheses around the column names after the word OVER – if not you’ll get a strange error message.
Here’s the data after adding the new calculated column:
Let’s take a closer look at this and manually calculate a few of the values to double-check the accuracy. To calculate Avg IP for Completion Type A in the North Field, let’s find well records (Well ID) that used Completion Type A and are located in the North Field (highlighted below):
Next, just take the average of the IP rate for these 3 values:
The next group to evaluate would include the rest of the wells in the North Field – those using Completion Type B:
Of course, I could keep going with Completion Type A in the South Field, but we’ll let Spotfire do the math from here!
Summarizing all of the data into the 6 unique values in the calculated column:
- Avg IP for Completion Type A in the North Field = 1,643
- Avg IP for Completion Type A in the South Field = 2,157
- Avg IP for Completion Type A in the West Field = 2,853
- Avg IP for Completion Type B in the North Field = 1,992
- Avg IP for Completion Type B in the South Field = 2,451
- Avg IP for Completion Type B in the West Field = 3,174
We could get the same result (without inserting a calculated column) by using a Cross Table or Bar Chart (or some other type of visualization) as shown below. Thanks to the Marking feature in Spotfire, it’s really quick to check the calculated column by simply clicking on rows in the Cross Table or Bars in the chart:
Comparing LeBron James’ Regular and Post Season Performance
If you’re not in oil and gas or that data doesn’t make sense to you, let’s look at a sports example. Since the NBA playoffs are going on right now, I’ll choose basketball.
LeBron James has played 13 NBA seasons with 2 different teams (Cleveland Cavaliers & Miami Heat) and has made 11 playoff appearances. Let’s say we want to compare James’ average scoring per game for each part of the Season (Regular or Playoffs).
The average scoring per game for each season has already been calculated for us, so we don’t have to worry about that. That leaves 2 groups of data that we want to compare (Team and Season), and each group has 2 unique values, for a total of 4 combinations. We should end up with 4 unique values in our calculated column:
- Avg scoring for Cleveland during Regular Season
- Avg scoring for Cleveland during Playoffs
- Avg scoring for Miami during Regular Season
- Avg scoring for Miami during Playoffs
Heres’ the data:
Because we want to compare James’ performance for each Team and Type of season, we need to use both of these columns in the calculated column. Here’s the expression:
Avg([Avg Points per Game]) OVER ([Team], [Season Type])
Summarizing this into the 4 unique values mentioned above:
- Avg scoring for Cleveland during Regular Season = 27.3 pts
- Avg scoring for Cleveland during Playoffs = 28.9 pts
- Avg scoring for Miami during Regular Season = 26.9 pts
- Avg scoring for Miami during Playoffs = 26.8 pts
We could get the same result (without inserting a calculated column) by using a Cross Table:
And once again, it’s really quick to check the calculated column by simply Marking rows in the Cross Table:
The Intersect Function – Example 1
Now I’ll add a layer of complexity to our problem by adding the element of time. For example, what is LeBron’s regular season average points per game to date (including the current year)? In other words, what’s his historical running average?
Let’s say we’re evaluating this historical-to-date average for the 2008-09 season. To calculate that by hand, we’d average values during the regular season between for seasons between 2003-2008:
Thankfully Spotfire has a built-in function that does this for us. It’s called AllPrevious. To calculate the average points per game to date, use the following expression:
Avg([Avg Points per Game]) OVER Intersect([Season Type], AllPrevious([Season]))
You’ll notice that writing this expression is very similar to the previous one, the only catch is that we must use the Intersect function with AllPrevious.
This is because for a given row’s calculation, AllPrevious is not using all unique values from the Season column – only values that are less than or equal to the current season. So to calculate the average to date, Spotfire takes the average only when Season type (regular or playoffs) intersects with all previous seasons for a given row.
Without Intersect, you’ll get the error message: “This expression is not valid”.
Here’s a list of functions that require Intersect when combined with the OVER function. Note that Intersect is only required when there are 2 or more columns after the OVER function.
The Intersect Function – Example 2
Here’s another example of something we might want to evaluate: Was LeBron’s 28.4 PPG average scoring for the 2008-09 regular season better or worse than the average of all previous seasons (not including the 2008-09)?
AllPrevious does include the current row in the calculation (in our example above, the Avg PPG for the 2008-09 season). It gets a little trickier if we want to exclude this value from the calculation:
(Sum([Avg Points per Game]) OVER Intersect([Season Type], AllPrevious([Season])) – [Avg Points per Game]) / (Count([Season]) OVER Intersect([Season Type], AllPrevious([Season])) – 1)
We can check the result by comparing it to the previous season’s average to date:
Another way to write that complex expression, since we already calculated Last Seasons Avg Points per Game to Date, is to just use the Previous function:
Avg([Avg Points per Game to Date]) OVER Intersect([Season Type], Previous([Season]))
This will take the value from the Previous Season in the Avg Points per Game to Date column. This also demonstrates that sometimes its easier to do a calculation in 2 steps with 2 calculated columns, instead of 1 column with a complex expression.
Here’s the result, which we can verify matches the other calculated column with the more complex expression:
With this new column, Last Seasons Avg Points per Game to Date, now we can see for a given season whether his performance improved or not compared to the average of all previous seasons:
[Avg Points per Game] – [Last Seasons Avg Points per Game to Date]
Looking at the data, we can see that LeBron’s scoring average started declining in the 2010-11 regular season when he moved to the Miami Heat. His playoff performance compared to previous years is up and down over time. This is more easily visualized with a Line Chart or Bar Chart, trellised by Season Type, as below:
Questions / Comments
Hopefully these examples have helped you understand how to use the OVER function with multiple columns and the Intersect function. If you have any questions or comments, or if there’s something you’d like featured on this blog, feel free to email me at email@example.com or leave a comment below.