I have a question regarding MCTS simulation. how to reward the Node when simulation win, loss or draw. According to some blog about alpha-zero
QU = Wi/Ni + C* Pi * N^2 / (1 + Ni)
if Reward Wi +1 for the win, Wi-1 for the loss, 0 for the draw.
when Wi get negative value like -4, actions with pi == 0 also get searched. seems not right. pi ==0 much no probability to add this action.
if Reward Wi +1 for the win, Wi+0.5 for the loss, Wi+0 for the draw. "Wi/Ni" equals to 0.5 when node easy to get the draw and the node is much bigger than node didn't search (which Wi == 0 ).
Comments
Post a Comment