Subject Filter

A scientist's take on the Game of Kings
| Chess Puzzles | Book Reviews | | Annotated Games | Opening Analysis | Science | First Time Here?

Wednesday, March 18, 2015

Finding Difference Makers



Using JavaScript tools I recently created, you can determine the square utilization or occupancy for any particular piece, for any given square, across all moves and all games in a PGN database. One potentially useful application is to apply these tools separately to the wins and losses (1-0 versus 0-1) in the database of your choice. As featured in a recent ChessBase article, I have done this to find which piece movements or placements are featured more often in wins versus losses. These are potential difference makers in a chess game.

In this post, I will describe in more detail how I performed these calculations. Although this can be done by hand, I created an Excel spreadsheet to facilitate processing of the data you can get from using the aforementioned tools. This spreadsheet is available for download; select 'Read More' to see the rest of this article to find instructions on how use the excel file, as well tips on how to represent the results using the heatmap tool. Stay tuned for more differential data and analysis in the coming weeks.




Using the 'Differential' Excel Workbook

The excel workbook available for download is fairly simple to use. You need to first generate your data of interest by running databases of wins or losses through the square utilization or occupancy programs. The output data is tab-delimited and can be directly copied and pasted into the excel workbook. In fact, the excel file can handle output data that includes multiple pieces (generated, for example, by using the "all pieces" option in the program).

In the excel workbook, there is a sheet labelled INPUT. This sheet will take input data for wins and losses (or whatever two different sets / databases you wish to compare) for both square utilization (traffic) or occupancy (parking). Paste the data for the traffic in wins into cell A4, the traffic in losses is pasted in J4. Likewise, paste the parking from the wins into cell R4, and the parking from the losses in cell AA4. If your excel workbooks do not recalculate automatically, you can refresh the workbook by saving the file.

You are basically done copying and calculating! Wasn't that easy? Now the results will be displayed in the sheet labelled OUTPUT. You can select this data and paste it into the chessboard heatmap tool to generate a heatmap showing the 'difference makers' between the two sets. There are several features of the heatmaps which can be customized for this type of dataset, described below.

Before moving onto the heatmaps, one point is worth noting. What do the resulting values actually represent? Well, the calculations are done as follows: the data for each piece in each set is normalized to a base of 1000 (similar to the way in which baseball batting averages are reported; a square that is used 36.5% of the time will get a value of 365). Then, these normalized values are subtracted from each other.

What does this mean for you? Basically, if you end up with a value of 10, this means that the move is found 1 percentage point more often in wins compared to losses (or whatever the two sets being compared are). 

Scaling a heatmap for 'Differential' data

The heatmap tool I have created can take tab-delimited data, either the raw square utilization data or the differential data mentioned above, and display the squares with colors shaded proportionally to their value. This is a nice and easy way to visualize what squares or parts of the board have the highest or lowest values in the dataset. I have described elsewhere how to customize the heatmaps to your preference. But what values and settings should you choose if you want to represent this 'differential' data?


Let's start with a simple example: below is the Square Occupancy (Parking) data for Fischer's pieces as White; positive numbers indicate moves more often seen during wins, negative numbers indicate moves more often seen during losses. Here we will look at this data for Fischer's King and Queen. This is the output if the default settings are used:

Please click for a larger image


The problem with comparing the two above heatmaps is that each is set to a different scale. With the default settings, the maximum and minimum are calculated from each setting and used to shade the squares. In other words, the maximum square for Queen occupancy, e2, is a value of 28, while the maximum square for King occupancy, g1, has a value of g1. These are shaded the same, even though they are not that close.

This can be overcome if we manually set the maximum and minimum values to the same value for both maps. Which values should we choose? I would recommend the rather extreme values of -1000 and 1000 (minimum and maximum, respectively). The moves are normalized to a base of 1000, and it is possible (although extremely unlikely) for a dataset to have a piece move to a single square only in wins (which would give a value of 1000) and to another only in losses (which would give a value of -1000).

In other words, using these values allows you to compare any two differential heatmaps. However, we now encounter another problem. If we chose these extreme values as the limits, most of the actual data will fall very close to the middle of that range, and will be shaded the same color:


Please click for a larger image


Fortunately, I was able to engineer a feature to fix this. You can 'scale' the shading of a dataset, so that only a small range around the median is shaded different colors. The default scale shades for any values between 0 and 100; if we select another scale, for example the option labelled "2 (48-52)", the shades represent different values between 2% of the median value. In other words, here anything that is between 52% and 100% of the maximum (which remember, is 1000) is shaded with the darkest color orange. Likewise, anything less than 48% is shaded the deepest blue color.


Please click for a larger image


With these settings, we can once again see the trends that were visible in the beginning. If you have a sharp eye, you will notice changes between these two heatmaps and the ones we started with. The most noticeable difference, and indeed the entire point of taking these steps, is that the maximum value in the Queen Occupancy data is shaded lighter than the maximum value from the King Occupancy data. This allows you to compare not only the square values within a dataset, but across datasets, and all by an easy glance at a heatmap!


By the way, for those that are curious, the differential data featured in the ChessBase article posted online has a maximum of 1000, minimum of -1000, and was set to the scale of 1. If you generate your own heatmaps and use these settings, the colors are directly comparable to the ones I made!






No comments:

Post a Comment