Forged Alliance Forever

Re: trueskill parameter tuning results

2017-01-07T04:29:14+02:00

Hi Jagg! Sorry for the late reply, not getting email notifications and such.

Yes I think beta=240, tau=10 is a good change. Apart from being closer to "optimal", I think players will be happier with it - their rating will continue to be at least a little responsive to changes in their skill (rust, training, practice) after they've played many games. Lack of responsiveness is something I've seen them complain about often.

In terms of predictive power of the rating system, I think changing the draw probably from 0.1 to 0.05 would be a step in the right direction too.

Axle

Statistics: Posted by Axle — 07 Jan 2017, 04:29

Re: trueskill parameter tuning results

2016-10-09T18:17:21+02:00

Hi Axle, I was wondering what figures you would suggest to be used in FAF as of now. This change was made in August: https://github.com/FAForever/server/com ... 2c16ce7eca
Tau was increased from 5 to 10. Beta was decreased from 250 to 240.
Do you think this is a good change? Should we be using numbers from your later posts, for example having a beta value of 150? Hope you will return and respond soon.

Statistics: Posted by JaggedAppliance — 09 Oct 2016, 18:17

Re: trueskill parameter tuning results

2016-05-12T14:05:21+02:00

So now I enable all the enhancements described above: draw probabilities that depend on map size; player skill as a function of map size; a rudimentary model of player rust; compensation for factional imbalance.

See below the resulting distribution of outcome probabilities that compare all the enhancements versus canonical trueskill (augmented 3 draw probabilities).

This time the difference is more pronounced than for the individual enhancements by themselves - a reduction in the number of games in with 40-60% probabilities and an increase in the number of games >70%. The overall NLML is greatly reduced from 89728 down to 89115. The optimum 'beta' parameter is down from 250 to 150, and 'tau' is down from 17 to 10. All good things indicating a more predictive rating system.

From a player-centric point of view, see TA4Life's rating progression. It is more stable, but he can expect it to respond more rapidly to his absences. And hopefully he'd be able to switch between UEF and cybran whenever he wants and have a slightly better chance of getting a fair matchup.

Thats it for now, its been a funky adventure working out all the details of factor graphs and putting trueskill on a course of steroids.

Statistics: Posted by Axle — 12 May 2016, 14:05

Re: trueskill parameter tuning results

2016-05-13T00:37:27+02:00

And the next thing you might wonder about is factional balance. The balance team sometimes makes changes to the game that turns the balance on its head. eg a long long time ago cybran used to suck real bad, they made some changes, and now UEF suck real bad instead. So I figured I want to dynamically track factional imbalance rather than model it with static parameters.

To do this I introduced "virtual-factional-allies" to the rating. Something similar was actually suggested by someone on the forum a few years ago but I can't find a good set of keywords to find the post (if the originator is reading this, shouts out to you!). The basic idea is that for each game, each player gets a virtual ally depending on the faction matchup. I'm UEF facing cybran? Then I get the UEF-v-Cybran ally, and my opponent gets the Cybran-v-UEF ally. At the end of game the spoils (or losses) in terms of rating update are distributed between me and my virtual ally as per canonical trueskill (ie bayesian inference).

Now these virtual-allies get different beta and tau parameters (and are exempt from rust / experience if used in conjunction with my rust model) in recognition that their underlying skill changes very infrequently - only when balance team makes a change. Also the mirror matchup virtual-allies (eg UEF-v-UEF) never get any rating updates, locked at mean=1500,sigma=0 in recognition that mirror matchups should always cancel as a given.

So that all sounds pretty exciting, but what do the results show?. The rating progression for the virtual-factional-allies does indeed show that cybran is OP and UEF sucks balls. I was hoping to see step changes in the progression that I could point to and say "aha! balance change" but its difficult to point to any definitive step change. Overall the NLML did drop, but not as much as the skill-as-a-function-of-mapsize and the rust models. Also disappointingly, no change in optimal beta or tau Also the distribution of outcome probabilities shows no discernable difference (apart from the aggregate quantitiative NLML), and TA4Life's rating progression shows no significant difference either.

It is interesting to note that certain faction matchups do appear to have a significant penalty. eg using uef against cybran may disadvantage you by up to 50 pts of skill! ( uef-v-cybran's ally has skill of about 1475, and therefore cybran-v-uef's ally has skill about 1525, for a total of 50 pts difference). If this is infact a reliable causitive measure of imbalance, it is conceivably worthwhile incorporating into the auto-match system for better player experience. For the player that always chooses UEF, theres no difference because his own personal rating will soon enough adjust accordinly and he'll be appropriately matched. However if he (or his opponent) likes to play random, or frequently changes faction, auto-matcher would now be able to instantly compensate.

I can't upload all the faction vs faction rating progressions, so I'll just summarise the results here:

aeon vs cybran: 16pts in cybran favour
aeon vs seraphim: 8pts in seraphim favour
cybran vs seraphim: 8pts in cybran favour
uef vs aeon: 24pts in aeon favour
uef vs cybran: 60pts in cybran favour
uef vs seraphim: 50pts in seraphim favour

Next to come: What if we put it all together?

Statistics: Posted by Axle — 12 May 2016, 14:03

Re: trueskill parameter tuning results

2016-05-12T14:15:31+02:00

The next part of trueskill that begs attention is the model of what happens to a player's skill *between* games. Trueskill has this 'tau' parameter, which basically says that if we reckon a player has skill 1800±50 based on the results of game N, then maybe by the time he enters his (N+1)th game he has skill 1800±60 because we know that human skill doesn't stay the same, it keeps changing. Probably something happened in between game N and game N+1 and his skill changed a bit in some unpredictable way.

Well if we imagine reasons why it might have changed, we might be able to go some way towards reducing the unpredictability of it. Two obvious factors that come to mind are rust and experience.

Rust: the longer a player abstains from playing, the more his technical abilities diminish and the more he forgets the nuances of his coveted and polished build orders.

Experience: Having lost one too many games by forgetting to build energy, maybe now, after this game, he can finally remember to build energy. Or maybe he just watched a few replays and learned some tricks. That all ought to be worth a few points of skill at least.

So in order to model rust and experience, I added some stuff to reduce the mean and increase the variance of player's skill depending on the time since last game, and also increased the mean just a little bit just for playing on the assumption that players will generally learn something by playing.

As before there was a great reduction in NLML, and also an interesting change in the optimal 'tau' parameter. This reduced from 18 down to 10. Now the situation isn't quite as simple as previously with the 'beta' parameter because while the 'tau' parameter is lower, we are adding variance elsewhere through the rust model. But what we can say is that we're not indiscriminately blanketing all games with the same amount of 'tau'. Only games where the player has had significant amounts of down-time do we add any significant amount of variance beyond tau. And that tau is now much lower so we can say that we've reduced the amount of residual uncertainty in our model - another win!

Again with the distribution of outcome probabilities we see a general, if slight, shift to the right. A reduction in number of games in the 50-60% region and an increase in the number of games in the >70% region. Player rating progressions show significant differences again. I've included Photon's progression this time because he has an interesting step change after a haitus just after the 200th game that I can point to. Without the rust model, his loss of points is kind of gradual, whereas with the rust model the loss of points is very rapid. But after that, the subsequent recovery of points is quite similar. And generally the ratings are much less eratic.

btw, it looks like the average player might initially lose points at a rate of 8pts/month of inactivity with uncertainty 30pts(stdev)/month. And he learns maybe 0.6 points just by playing a game.

Next to come: What happens if we control for factional imbalance?

Statistics: Posted by Axle — 12 May 2016, 13:54

Re: trueskill parameter tuning results

2016-05-12T14:13:58+02:00

So if its obvious that draw probability depends on map size, its pretty reasonable to wonder how much individual players' skill depends on map size too. So I added some stuff to examine exactly that.

Instead of keeping track of a single scalar skill (with scalar mean and variance), we track a vector of skill parameters (with a mean vector and covariance matrix). For any given map size, the applicable player skill is some (map-size dependent) linear combination of the skill vector. Upon win, loss or draw, we propagate that information backwards through the linear reduction back into the skill vector resulting in an updated mean vector and covariance.

This arrangement should be superior to just simply maintaining independent skill ratings for each map size because it recognises that there should be a fair degree of correlation between how skillful a player is on small maps and how skillful he is on medium and large maps. Its also conceivably possible to model a variety of other game specific factors this same way - eg whether or not players are divided by water, abundance of mexes and reclaim, etc but we'll just stick to the 3-map-size factors for now.

And what happens when? Well firstly, after finding the optimum weights for the linear combination, we find that the NLML drops significantly. This is good, it indicates more accurate predictions. But also, the optimal 'beta' parameter drops from 250 down to 150. This is very interesting because the beta parameter models how much uncertainty there is in a player's *performance* given his skill level. ie if I know that a player's actual skill is precicely 1800 (with zero standard deviation), the beta parameter tells me that in any given game he is likely *perform* at a level anywhere between 1800-3beta to 1800+3beta. The factors that influence his performance are probably numerous and many of them will be unpredictable ... time of day, level of caffeination, how much his girlfriend distracts him, but also .... size of the map! Having controlled for map size, the degree of uncertainty in performance is reduced, and not just by a little bit. 250 down to 150 is a big deal.

The attached distribution shows a slight move to the right. Specifically a reduction in the number of games with outcome probability in the 50-60% region and a slight increase in the number of games with probability >70%. From a player centric point of view, take TA4Life's rating progression (this plot is actually the average of the 3-vector skill). It is much more stable and his final rating is much different than before.

Next to come: What happens if we incorporate a model of rust?

Statistics: Posted by Axle — 12 May 2016, 13:51

Re: trueskill parameter tuning results

2016-05-13T00:41:39+02:00

A while ago Yorick wondered what happens if you model draw probability as a function of map size. Well lo and behold, here are results from a modified trueskill that models a different draw probability as a function of map size: small (eg theta), medium (eg Loki) or large (eg Seraphim Glaciers).

It turns out the "optimal" draw probabilities are 7.6%, 2.9% and 2.6% for small, medium and large maps respectively.

I have two attachments:
- distribution of outcome probabilities for the 3-class trueskill versus canonical trueskill.
- progression rating for my unit standard player (TA4Life)

Firstly, there is an improvement in overall NLML (neg-log-marginal likelihood, referred to just as L in the previous posts).
Secondly, the only significant difference in distribution of outcome probabilities is in the spikes on the left hand side (towards the lower probabilities).
Thirdly, there is no practical difference in actual player rating.

So it looks to me that the map-size specific draw probabilities do not affect player rating significantly, but would help the auto-matcher more accurately predict a draw (rather than just call a close game) - whether or not this is useful is debatable.

As an aside I'll note that I did implement Ralf Herbrich's newer modifications to trueskill, where he attempts to track each individual player's draw probability [ http://research.microsoft.com/apps/pubs ... x?id=74417 ]. However the results (in terms of NLML) weren't as good as just setting a fixed draw probability for each map size so I dropped that line of inquiry - I guess there aren't enough Lames to make it worthwhile.

Next to come: What happens if we track the skill of each player as a function of map size?

Statistics: Posted by Axle — 12 May 2016, 13:49

Re: trueskill parameter tuning results

2016-02-25T22:46:54+02:00

Data is on github https://github.com/Axle1975/pytrueskill . See data/replaydumper/*.csv

Statistics: Posted by Axle — 25 Feb 2016, 22:46

Re: trueskill parameter tuning results

2016-02-25T15:23:51+02:00

Where did you get your stats for this? I'd quite like to have a go at tweaking things myself

Statistics: Posted by Softly — 25 Feb 2016, 15:23

Re: trueskill parameter tuning results

2016-02-25T15:02:41+02:00

I made some histograms of the predicted probability of outcome, P(x), for three cases:
- optimal parameters (RED, beta=240, tau=18, pdraw=0.045)
- near optimal parameters (GREEN, beta=240, tau=10, pdraw=0.045)
- existing parameters (BLUE, beta=250, tau=5, pdraw=0.1)

The spikes at the lower probability end are due to draws.

For everything else, there's a general increase in prediction probability when changing to the optimal parameters.

If you want to see QQ plots, they're in github in the data/learning_results directory, but they pretty much tell the same story.

Statistics: Posted by Axle — 25 Feb 2016, 15:02

Re: trueskill parameter tuning results

2016-02-25T04:52:11+02:00

Thanks Softly, thats a good idea, a plot of the distribution of P(x) would be very informative. I'll get onto it when I have time.

I wouldn't expect P(x)'s to be all that much better than random chance, considering that the match maker purposely chooses closely matched players. This means we need lots of data to be able to see the patterns through the noise. Which reminds me:

I notice that I only have ~1600 games for TA4Life. I know that he played over 6500 games at the time did the capture. Such downsampling of data might (or might not) influence the optimal tau. @Sheeo, @Aulex, @anyone, what would be really good is if we could get a database dump so we can have ALL game results to play with.

Statistics: Posted by Axle — 25 Feb 2016, 04:52

Re: trueskill parameter tuning results

2016-02-24T14:44:55+02:00

Sounds good!

Do you have some stats for the typical contribution to L for a single game (how are these distributed)? Reading roughly from the previous graphs it seems that the mean is in the region -0.6 to -0.9, suggesting the confidence in the actual result was ~0.4 to ~0.55. Of course this doesn't take into account that a draw will add a significant above average contribution (a confidence of 0.1 in the draw would contribute ~ -2.3).

It seems to perform better than random guessing (which I work out to have a typical confidence of ~0.38 for pdraw 0.1).

Statistics: Posted by Softly — 24 Feb 2016, 14:44

Re: trueskill parameter tuning results

2016-02-24T03:29:37+02:00

Hi Softly, lets examine your example: use (P(x)-1) as a measure of success.

The objective would be to maximise P(x)-1 over all games. Lets say that we're going to sum P(x)-1 over all games. This is the same as maximising sum(P(x)) over all games.

The L metric is infact sum(ln(P(x))) so pretty similar. But why the log? well consider the probability of the outcome of a sequence of N independent games with outcomes x1,x2,x3,...xN. The probability of that sequence of outcomes is Ptotal = prod(P(xi)). ie log(Ptotal) = sum(log(P(xi)) = L. By maximising L we maximise the probability of the sequence of outcomes given a specific set of trueskill parameters (beta, tau, pdraw) - ie the likelihood of the parameters given the outcomes - ie we have the maximum likelihood estimate of parameters.

Ok, but still.... why the log? why not maximise Ptotal directly? If you plot Ptotal versus your trueskill parameters you get a very short narrow spike at the optimum Ptotal. This make a difficult function to search because basically everything is nearly zero except for the single data point right at the optimum. log(Ptotal) has a much broader shape, much easier to use a hill climbing optimisation algorithim on.

Also, depending on the details sum(log(P)) might be more efficient, and actually more accurate to calculate. You could get your trueskill functions to tell you directly log(Pdraw), log(Pwin) and log(Plose). For many probability distributions (eg Gaussian) this saves a call to the exp function.

Statistics: Posted by Axle — 24 Feb 2016, 03:29

Re: trueskill parameter tuning results

2016-02-24T02:39:37+02:00

I'd say that assigning probabilities to to each possible is precisely a prediction of the outcome.

What's the advantage of the L metric over other possibilities?

For example if the true outcome was x, then we could use the value (P(x) - 1) as a measure of success.

Statistics: Posted by Softly — 24 Feb 2016, 02:39

Re: trueskill parameter tuning results

2016-02-24T01:30:09+02:00

Hi Softly! I do believe that the L metric shown above is a direct measure of how well trueskill predicts results.

Success rate is only part of the story. Recall that trueskill doesn't really predict an outcome, it only gives liklihoods of 3 outcomes. eg 5% probability of draw; 47% probablity player 1 wins; 48% probability player 1 loses. You could say in this case that its predicted a loss for player 1, but the level of confidence isn't very high and that single statement doesn't really capture what trueskill is really telling you.

The L metric is just the sum of the log of the probability of the actual outcome. eg with Pwin=0.9 and Ploss=0.1, a WIN outcome would contribute ln(0.9) = -0.1 to L whereas a LOSS outcome would contribute ln(0.1) = -2.3

In order to get a larger L, our trueskill must not only make a greater number of correct predictions, but make those predictions with higher confidence. ie if it calls Pwin=0.9, Ploss=0.1, a WIN outcome is much stronger in this case than if it had called Pwin=0.6, Ploss=0.4.

That said, I can appreciate that the average FAF player doesn't care about probabilities, he just wants to see something he can relate to. When I get time I'll look at % success rate as a function of these tuning parameters and post results.

Statistics: Posted by Axle — 24 Feb 2016, 01:30