Next, you suggest that XCS learns accurate specific rules first, then uses these to evolve accurate generalizations. That is what one sees in experiments that begin either with a random population of fairly specific initial classifiers, or with a null initial population and a low # probability in created classifiers. However, it is also possible to start an experiment from the full generality end: the initial population consists of one classifier for each action, each with a fully general condition. In this case the system begins with highly inaccurate overgeneral classifiers and proceeds toward accurate specifics. Such an initialization works well in the multiplexer case (possibly not so well in Woods2; I haven't tested this much); the performance is nearly as good as shown in Figure 3. So I would say that the search can arrive at accurate general classifiers from either the overgeneral or the overspecific direction.
Finally, you asked about using a less severe accuracy criterion. Initially, when I first had the idea of basing fitness on accuracy alone (in November '93), I made it proportional to the inverse of a classifier's absolute error. The system worked well enough (on the 6-multiplexer with the "ramp" payoff environment) to tell me there was something important going on. That is, good mapping occurred and maximally general classifiers were emphasized. However, other things were nowhere near as good as the results shown in Figure 3. The (macroclassifier) population was more than twice as long, and the average error was much greater.
Between that experiment and those of Figures 3 and 4, there were many, many further experiments, with numerous changes in the way the system operated, not just the accuracy function. The details of the present system result from a chain of modifications--a hill-climbing--so it is hard to be sure just which are essential.
However, in the accuracy function itself, the error threshold (below which the accuracy is defined to be constant) came in through experiments, not yet reported, on mapping of continuous functions instead of Boolean ones. The exponential fall-off also came in there. In continuous functions, the threshold made sense because otherwise classifiers compete for pin-point accuracy and the generalization mechanism doesn't work well. With the threshold, it does. The exponential fall-off came in because I wanted a way to "lead" classifiers having above-threshold error toward lower error, or toward higher accuracy. I made the fall-off exponential because it was simplest to think about and implement, given the existence of the threshold.
Then I brought the threshold-plus-exponential-falloff function back to the discrete setting, where it worked well. I kept it because I then had a mechanism that appeared to work in both domains.
As a speculation, the exponential fall-off may be optimal for opposing the tendency toward overgeneralization. If you are a classifier, then (in the multiplexer) by adding a don't care you double your reproductive chances. Thus "effective fitness" is exponential in number of hashes. To prevent overgeneralization in such circumstances, you may need an accuracy function that is itself exponential.
Thus I am inclined to stick for the present with the exponential. The whole area of the proper relation of fitness to accuracy is certainly still very much open, though.
Your accuracy measure seems to give essentially the same (low) fitness to all classifiers that make more than a small number of errors. Its hard to see how the GA can use these poor performers to help discover more accurate classifiers. My guess is that XCS learns accurate specific rules first, then uses those to discover accurate generalizations. Is that right? How would XCS performance be affected if you used a less severe accuracy criterion (e.g. something proportional to the squared error)?