trueskill parameter tuning results

Post here if you want to help developing something for FAF.

Re: trueskill parameter tuning results

Postby Axle » 24 Feb 2016, 01:30

Hi Softly! I do believe that the L metric shown above is a direct measure of how well trueskill predicts results.

Success rate is only part of the story. Recall that trueskill doesn't really predict an outcome, it only gives liklihoods of 3 outcomes. eg 5% probability of draw; 47% probablity player 1 wins; 48% probability player 1 loses. You could say in this case that its predicted a loss for player 1, but the level of confidence isn't very high and that single statement doesn't really capture what trueskill is really telling you.

The L metric is just the sum of the log of the probability of the actual outcome. eg with Pwin=0.9 and Ploss=0.1, a WIN outcome would contribute ln(0.9) = -0.1 to L whereas a LOSS outcome would contribute ln(0.1) = -2.3

In order to get a larger L, our trueskill must not only make a greater number of correct predictions, but make those predictions with higher confidence. ie if it calls Pwin=0.9, Ploss=0.1, a WIN outcome is much stronger in this case than if it had called Pwin=0.6, Ploss=0.4.

That said, I can appreciate that the average FAF player doesn't care about probabilities, he just wants to see something he can relate to. When I get time I'll look at % success rate as a function of these tuning parameters and post results.
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Softly » 24 Feb 2016, 02:39

I'd say that assigning probabilities to to each possible is precisely a prediction of the outcome.

What's the advantage of the L metric over other possibilities?

For example if the true outcome was x, then we could use the value (P(x) - 1) as a measure of success.
Softly
Supreme Commander
 
Posts: 1005
Joined: 26 Feb 2012, 15:23
Location: United Kingdom
Has liked: 150 times
Been liked: 241 times
FAF User Name: Softles

Re: trueskill parameter tuning results

Postby Axle » 24 Feb 2016, 03:29

Hi Softly, lets examine your example: use (P(x)-1) as a measure of success.

The objective would be to maximise P(x)-1 over all games. Lets say that we're going to sum P(x)-1 over all games. This is the same as maximising sum(P(x)) over all games.

The L metric is infact sum(ln(P(x))) so pretty similar. But why the log? well consider the probability of the outcome of a sequence of N independent games with outcomes x1,x2,x3,...xN. The probability of that sequence of outcomes is Ptotal = prod(P(xi)). ie log(Ptotal) = sum(log(P(xi)) = L. By maximising L we maximise the probability of the sequence of outcomes given a specific set of trueskill parameters (beta, tau, pdraw) - ie the likelihood of the parameters given the outcomes - ie we have the maximum likelihood estimate of parameters.

Ok, but still.... why the log? why not maximise Ptotal directly? If you plot Ptotal versus your trueskill parameters you get a very short narrow spike at the optimum Ptotal. This make a difficult function to search because basically everything is nearly zero except for the single data point right at the optimum. log(Ptotal) has a much broader shape, much easier to use a hill climbing optimisation algorithim on.

Also, depending on the details sum(log(P)) might be more efficient, and actually more accurate to calculate. You could get your trueskill functions to tell you directly log(Pdraw), log(Pwin) and log(Plose). For many probability distributions (eg Gaussian) this saves a call to the exp function.
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Softly » 24 Feb 2016, 14:44

Sounds good!

Do you have some stats for the typical contribution to L for a single game (how are these distributed)? Reading roughly from the previous graphs it seems that the mean is in the region -0.6 to -0.9, suggesting the confidence in the actual result was ~0.4 to ~0.55. Of course this doesn't take into account that a draw will add a significant above average contribution (a confidence of 0.1 in the draw would contribute ~ -2.3).

It seems to perform better than random guessing (which I work out to have a typical confidence of ~0.38 for pdraw 0.1).
Softly
Supreme Commander
 
Posts: 1005
Joined: 26 Feb 2012, 15:23
Location: United Kingdom
Has liked: 150 times
Been liked: 241 times
FAF User Name: Softles

Re: trueskill parameter tuning results

Postby Axle » 25 Feb 2016, 04:52

Thanks Softly, thats a good idea, a plot of the distribution of P(x) would be very informative. I'll get onto it when I have time.

I wouldn't expect P(x)'s to be all that much better than random chance, considering that the match maker purposely chooses closely matched players. This means we need lots of data to be able to see the patterns through the noise. Which reminds me:

I notice that I only have ~1600 games for TA4Life. I know that he played over 6500 games at the time did the capture. Such downsampling of data might (or might not) influence the optimal tau. @Sheeo, @Aulex, @anyone, what would be really good is if we could get a database dump so we can have ALL game results to play with.
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Axle » 25 Feb 2016, 15:02

I made some histograms of the predicted probability of outcome, P(x), for three cases:
- optimal parameters (RED, beta=240, tau=18, pdraw=0.045)
- near optimal parameters (GREEN, beta=240, tau=10, pdraw=0.045)
- existing parameters (BLUE, beta=250, tau=5, pdraw=0.1)

The spikes at the lower probability end are due to draws.

For everything else, there's a general increase in prediction probability when changing to the optimal parameters.

If you want to see QQ plots, they're in github in the data/learning_results directory, but they pretty much tell the same story.
Attachments
distribution_predicted_probability_of_outcome.png
distribution_predicted_probability_of_outcome.png (52.08 KiB) Viewed 1464 times
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Softly » 25 Feb 2016, 15:23

Where did you get your stats for this? I'd quite like to have a go at tweaking things myself :)
Softly
Supreme Commander
 
Posts: 1005
Joined: 26 Feb 2012, 15:23
Location: United Kingdom
Has liked: 150 times
Been liked: 241 times
FAF User Name: Softles

Re: trueskill parameter tuning results

Postby Axle » 25 Feb 2016, 22:46

Data is on github https://github.com/Axle1975/pytrueskill . See data/replaydumper/*.csv
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Axle » 12 May 2016, 13:49

A while ago Yorick wondered what happens if you model draw probability as a function of map size. Well lo and behold, here are results from a modified trueskill that models a different draw probability as a function of map size: small (eg theta), medium (eg Loki) or large (eg Seraphim Glaciers).

It turns out the "optimal" draw probabilities are 7.6%, 2.9% and 2.6% for small, medium and large maps respectively.

I have two attachments:
- distribution of outcome probabilities for the 3-class trueskill versus canonical trueskill.
- progression rating for my unit standard player (TA4Life)

Firstly, there is an improvement in overall NLML (neg-log-marginal likelihood, referred to just as L in the previous posts).
Secondly, the only significant difference in distribution of outcome probabilities is in the spikes on the left hand side (towards the lower probabilities).
Thirdly, there is no practical difference in actual player rating.

So it looks to me that the map-size specific draw probabilities do not affect player rating significantly, but would help the auto-matcher more accurately predict a draw (rather than just call a close game) - whether or not this is useful is debatable.

As an aside I'll note that I did implement Ralf Herbrich's newer modifications to trueskill, where he attempts to track each individual player's draw probability [ http://research.microsoft.com/apps/pubs ... x?id=74417 ]. However the results (in terms of NLML) weren't as good as just setting a fixed draw probability for each map size so I dropped that line of inquiry - I guess there aren't enough Lames to make it worthwhile.

Next to come: What happens if we track the skill of each player as a function of map size?
Attachments
Trueskill-3ClassDrawMargin-progression-TA4Life.png
TA4Life rating progression (three-class draw margin [green] versus canonical trueskill [blue])
Trueskill-3ClassDrawMargin-progression-TA4Life.png (54.6 KiB) Viewed 1355 times
Trueskill-3ClassDrawMargin-distribution.png
distribution of outcome probabilities (three-class draw margin [green] versus canonical trueskill [blue])
Trueskill-3ClassDrawMargin-distribution.png (42.63 KiB) Viewed 1355 times
Last edited by Axle on 13 May 2016, 00:41, edited 2 times in total.
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

Re: trueskill parameter tuning results

Postby Axle » 12 May 2016, 13:51

So if its obvious that draw probability depends on map size, its pretty reasonable to wonder how much individual players' skill depends on map size too. So I added some stuff to examine exactly that.

Instead of keeping track of a single scalar skill (with scalar mean and variance), we track a vector of skill parameters (with a mean vector and covariance matrix). For any given map size, the applicable player skill is some (map-size dependent) linear combination of the skill vector. Upon win, loss or draw, we propagate that information backwards through the linear reduction back into the skill vector resulting in an updated mean vector and covariance.

This arrangement should be superior to just simply maintaining independent skill ratings for each map size because it recognises that there should be a fair degree of correlation between how skillful a player is on small maps and how skillful he is on medium and large maps. Its also conceivably possible to model a variety of other game specific factors this same way - eg whether or not players are divided by water, abundance of mexes and reclaim, etc but we'll just stick to the 3-map-size factors for now.

And what happens when? Well firstly, after finding the optimum weights for the linear combination, we find that the NLML drops significantly. This is good, it indicates more accurate predictions. But also, the optimal 'beta' parameter drops from 250 down to 150. This is very interesting because the beta parameter models how much uncertainty there is in a player's *performance* given his skill level. ie if I know that a player's actual skill is precicely 1800 (with zero standard deviation), the beta parameter tells me that in any given game he is likely *perform* at a level anywhere between 1800-3beta to 1800+3beta. The factors that influence his performance are probably numerous and many of them will be unpredictable ... time of day, level of caffeination, how much his girlfriend distracts him, but also .... size of the map! Having controlled for map size, the degree of uncertainty in performance is reduced, and not just by a little bit. 250 down to 150 is a big deal.

The attached distribution shows a slight move to the right. Specifically a reduction in the number of games with outcome probability in the 50-60% region and a slight increase in the number of games with probability >70%. From a player centric point of view, take TA4Life's rating progression (this plot is actually the average of the 3-vector skill). It is much more stable and his final rating is much different than before.

Next to come: What happens if we incorporate a model of rust?
Attachments
Axeskill-3ClassRating-progression-TA4Life.png
TA4Life rating progression (3-class ratings for each player [blue] versus single-class ratings [green])
Axeskill-3ClassRating-progression-TA4Life.png (64.63 KiB) Viewed 1355 times
Axeskill-3ClassRating-distribution.png
distribution of outcome probabilities (3-class ratings for each player [blue] versus single-class ratings [green])
Axeskill-3ClassRating-distribution.png (45.98 KiB) Viewed 1355 times
Last edited by Axle on 12 May 2016, 14:13, edited 1 time in total.
Axle
Avatar-of-War
 
Posts: 79
Joined: 02 Apr 2013, 10:14
Has liked: 0 time
Been liked: 3 times
FAF User Name: Axle

PreviousNext

Return to Contributors

Who is online

Users browsing this forum: No registered users and 1 guest