Note that the important thing is to give classifiers having different errors different accuracies, and thus fitnesses. Then those with the least error will win out. It is not important that the classifiers be "accurate" in the sense of having errors less than epsilon0. My use of the term "accurate" was merely technical, to denote classifiers with errors less than the threshold.
(In the following I use p to represent the classifiers prediction, Eo to represent Epsilon sub 0 - the minimum prediction error divided by the maximum payoff, Ej to represent Epsilon sub J - the classifiers current prediction error normalised in the same way, , and F to represent the current fitness. : )
I have been playing about with your updates of p Eo and F. It would appear that, if you are in an enviroment where the payoffs received are even slightly non-deterministic with respect to a classifier's condition, it is difficult to obtain an accuracy value of 1.0.
This accuracy value is obtained when the cut-off accuracy function obtains a rule prediction error (Ej) of less than the (user specified) minimum error. However, when we have a slightly oscillating reward sequence, such as 9.0, 9.5, 9.3 (out of a maximum reward range of 10), then Ej does not reduce sufficiently to trigger the cut-off accuracy calculation to produce 1.0 when Eo is set to 0.01.
At the same time, the steep logarithmic curve of the accuracy calculation allows (in this example) the computed accuracy values to hover around 0.1 - clearly a relatively poor accuracy result even when the classifier predicts the reward fairly well. Setting alpha to a higher figure (in this example it was 0.1) will, of course, improve things since the logarthmic curve will be less steep, and the accuracy values returned will rapidly move towards this higher alpha, although the profile of the curve now provides less distinction between classifiers which have not yet reached the accuracy cut-off point.
Variable payoffs could easily occur in a robotics control problem where the classifier inputs are reduced from the actual inputs that the robot receives, or where the robots sensors are not accurate enough to distinguish all enviromental states as separate states, inspite of the fact that the payoffs are close.
My simulation of the calculations could, of course, be in error, and the use of the discounted Max as the payoff would tend to smooth such instances somewhat, but I remain concerned that the steep logarithmic curve might cause some classifiers to be classed as inaccurate when in fact they are predicting a slightly non-deteministic enviromental signal well.