Active Online Learning in the Binary Perceptron Problem
Zhou Hai-Jun1, 2, †
1CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
2School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

 

† Corresponding author. E-mail: zhouhj@itp.ac.cn

Supported by the National Natural Science Foundation of China under Grant Nos. 11421063 and 11747601 and the Chinese Academy of Sciences under Grant No. QYZDJ-SSW-SYS018

Abstract
Abstract

The binary perceptron is the simplest artificial neural network formed by N input units and one output unit, with the neural states and the synaptic weights all restricted to ±1 values. The task in the teacher-student scenario is to infer the hidden weight vector by training on a set of labeled patterns. Previous efforts on the passive learning mode have shown that learning from independent random patterns is quite inefficient. Here we consider the active online learning mode in which the student designs every new Ising training pattern. We demonstrate that it is mathematically possible to achieve perfect (error-free) inference using only N designed training patterns, but this is computationally unfeasible for large systems. We then investigate two Bayesian statistical designing protocols, which require 2.3N and 1.9N training patterns, respectively, to achieve error-free inference. If the training patterns are instead designed through deductive reasoning, perfect inference is achieved using N + log2N samples. The performance gap between Bayesian and deductive designing strategies may be shortened in future work by taking into account the possibility of ergodicity breaking in the version space of the binary perceptron.

1 Introduction

The perceptron invented by Frank Rosenblatt in 1957 is probably the simplest artificial neural network.[1] It has N inputs and one output, with each input neuron i affecting the output neuron through a synapse of weight Ti.[23] Given an N-dimensional input vector ξ ≡ (ξ1, ξ2,…, ξN), the binary output state σ is determined according to a (highly nonlinear) sign function
where T ≡ (T1, T2,…,TN) denotes the synaptic weight vector. The output σ = 1 if the overlap between ξ and T, defined as ∑i Tisi, is positive and σ = −1 if this overlap is negative. We consider the binary (or Ising) perceptron problem, so all the synaptic weights and neural states are restricted to be Ising-valued (i.e., Ti, ξi ∈ ±1 for every input neuron i of the system). These constraints make the binary perceptron much more challenging to study than the continuous counterpart.[23]

The perceptron can serve as a linear classifier. In this scenario, given P patterns with indices μ = 1, 2,…,P and their binary labels σμ, the task is to find a vector T such that these P patterns are correctly classified, that is, σμ = sign(ξμ; T) holds for each and every one of these P patterns.[23] The whole set of all such compatible weight vectors form the version space (or solution space) of the perceptron. If the input binary patterns ξμ are random and independent, it was predicted using the replica method of statistical physics that there exist binary solutions T to this classification problem as long as P < 0.83N.[4] Several message-passing algorithms[56] inspired by the cavity method of statistical physics have been implemented to solve single instances of the perceptral classification problem. More recently it was revealed that, as the density αP/N of random input binary patterns increases, the typical (equilibrium) solutions of this classification problem become widely separated from each other and are extremely hard to reach,[78] while there are also sub-dominant dense regions in the version space, which could be reached through entropy-weighted sampling strategies.[910]

The perceptron can also be studied from the teacher–student perspective, with T understood as the teacher’s weight vector which is hidden to the student. For each query ξ to the teacher, the correct label σ(ξ) as computed through Eq. (1) is revealed to the student, and the task for the student is to infer the hidden vector T by learning from the (ξ, σ) associations.[23] This inference task may be carried out in an offline manner, meaning that the training samples are repeatedly examined by the student during the learning process. It may also be carried out in an online manner, meaning that each training sample is used only once by the student to update his/her belief on T. We study online learning in the present work.

The learning performance of the teacher–student perceptron system has been investigated by many authors. In the passive learning mode for which the training patterns are independent and random, it was predicted that perfect (error-free) inference of a binary vector T is theoretically possible with P ≈ 1.25N binary samples.[1112] But this theoretical limit has never been achieved by actual heuristic algorithms. Theoretical and numerical studies on various algorithms have found that the generalization error ε of the passive learning process decreases with the pattern density α algebraically, e.g., εα−1. This means that in the thermodynamic limit of N → ∞, perfect inference is unlikely to achieve at any finite value of pattern density α.[1319]

In this work we address the issue of active learning, which aims at accelerating online inference by carefully designing the training patterns. After the student has encountered P samples and has already gained some knowledge about the truth vector T, how should she/he design the (P + 1)-th query so that the answer from this new query will be most informative for inference? This interesting question was explored by many authors in the early 1990s (see, e.g., Refs. [2027]), but the focus was on minimization of the generalization error rather than on error-free inference. In Sec. 2 of the present paper we prove that error-free learning of T can be achieved using at most N +log2N designed training patterns through deductive reasoning, which is only slightly beyond the theoretical lower bound, N. If the optimal Bayesian inference strategy is used instead of deductive logic, we find that error-free inference using exactly N designed samples is indeed possible, but it is computationally feasible only for small systems (Sec. 3). We then implement two heuristic designing algorithms in Secs. 4 and 5 for large systems based on this optimal Bayesian principle. Our simulation results demonstrate that these two heuristic algorithms need 2.3N and 1.9N training samples, respectively, to achieve error-free inference.

Although the deductive-logic algorithm certainly outperforms the data-driven Bayesian algorithms, the observation that the Bayesian statistical approach achieves perfect inference of N bits with less than 2N one-bit measurements is still quite encouraging. We expect that the performance of the Bayesian active inference algorithms will be further improved after taking into account the possibility of ergodicity breaking in the version space of the perceptron. If the version space divides into a large number of well-separated clusters, the assumption of Gaussian distribution of the mean field theory will no longer be valid (Sec. 6), and more advanced mean field theories such as the fist-step replica-symmetry-breaking cavity method will be needed to better describe the complicated statistical correlations of the version space.

The concept of active (or adaptive) learning has been widely discussed in the fields of education science[28] and optimal experimental design.[2930] Science itself may also be considered as an active learning process,[29] for which data-inspired intuitive insights, controlled experiments, and deductive reasoning are all indispensable. Artificial deep neural networks are becoming powerful tools for extracting the most important features from huge amount of data,[3132] facilitating hypothesis formation and experimental design. Recently there has been great enthusiasm in this direction, and efficient active learning algorithms for deep neural networks are start to be explored.[3336] A lot of work remains to be done on this important issue. From the theoretical side, the binary perceptron may serve as a simple model system to push for the limit of active Bayesian learning. Another basic inference problem which is closely related to the perceptron model is the so-called one-bit compressed sensing problem.[3738] At the moment only the passive mode of one-bit compressed sensing has been considered in the literature. Beyond the single-layered perceptron and one-bit compressed sensing, the next and more challenging model is the multi-layered binary perceptron system.

2 Inference by Deductive Reasoning

We first show that if the student employs deductive reasoning, online perceptral learning can be made very efficient. At the start of the learning process, the student can simply choose an arbitrary pattern ξ = (ξ1,ξ2,…,ξN) ∈ {−1, +1}N as the query. For convenience of discussion, let us define a particular overlap function q(n) on the integer domain n ∈ {0, 1,…,N} as
This function is simply the overlap (or the scalar product) of the teacher’s weight vector T and the modified pattern ξ(n) ≡ (−ξ1,… −ξn, ξn+1, …,ξN) after flipping the first n entries of ξ. An example of function q(n) is illustrated in Fig. 1. Because N is odd, q(n) takes only odd values. And because q(N) = −q(0) and |q(n) − q(n + 1)| = 2 for any n < N (quasi-continuity), there must exist at least one n* ∈ {0, 1,…,N − 1} for which |q(n*)| = |q(n* + 1)| = 1 and q(n* + 1) = −q(n*).

Fig. 1 An example of the overlap function q(n) for a small perceptron of size N = 33. The random initial binary pattern ξ happens to have overlap q(0) = 7 with the teacher’s binary weight vector T, and q(n) is the new overlap with T after the first n entries of ξ are all flipped. In this example q(n) changes sign seven times.

The value of such an integer n* can be determined through at most log2N queries. Starting from nl = 0 and nr = N, the query sequence goes as follows: (i) feed pattern ξ(nl) to the perceptron to get the sign of q(nl); (ii) set nm ≡ ⌈(nl+nr)/2⌉ and feed pattern ξ(nm) to the perceptron to get the sign of q(nm); (iii) if q(nm) and q(nl) have the same sign, set nl = nm, otherwise set nr = nm; (iv) if nr = nl +1, set n* = nl and quit, otherwise repeat steps (ii)–(iv).

The value of q(n*), i.e. the overlap between ξ(n*) and T, must be +1 or −1. Because of this fact, if one flips the i-th entry of ξ(n*) and feeds the resulting slightly modified pattern to the perceptron, the value of Ti can be deduced from the output. That is, if the output is the same as the sign of q(n*), then Ti must be equal to , otherwise Ti must be equal to .

By the above-mentioned deductive reasoning method, the binary weight vector T can be exactly determined after at most N + log2N queries. But it is clear from the above discussions that the success of this logical approach requires a deep understanding of the perceptron system. We continue to explore other active learning strategies in the next three sections.

3 Version Space Minimization

After P training samples ξμ, σμ) have been experienced, for a weight vector J = (J1, J2, …,JN) ∈ {−1, +1}N to be compatible with all these P samples, it must satisfy
for every one of these samples. We denote by ΣP the set of all the Ising weight vectors J satisfying these P constraints, and refer to this set as the version space at stage P. Notice the teacher’s vector T is always a member of ΣP, so the volume |ΣP| of the version space must be positive-definite. To accelerate online learning, a simple idea would be to reduce the volume of the version space as much as possible with each new training pattern. When finally the version space shrinks to a single point, this surviving element must be T. Given the version space ΣP at stage P, then how should the student construct the next, (P + 1)-th, Ising training pattern ξP+1?

Consider two vectors J, J′ of ΣP. Each of them has equal probability to be the truth vector T (the uniform Bayesian prior distribution is assumed). If J happens to be the truth, then the other vector J′ can be refuted by a test pattern ξ if sign(ξ; J′) = −sign(ξ; J). The probability of a randomly chosen vector J′ ∈ ΣP being refuted by the test pattern ξ is then
On the other hand, every J ∈ ΣP is equally likely to be the truth T, so we need to maximize the mean value of the above expression over all the different choices of J,
which leads to the following constraint
In other words, ξP+1 should be designed to divide the old version space ΣP into two parts of (almost) equal size. This designed pattern ξP+1 must not be completely random, since the corresponding sum (6) for a completely random pattern will be of order . After ξP+1 has been examined, one half of the members of ΣP will be discarded and the surviving vectors form the new version space ΣP+1. Let us remark that this idea of version space bisection began to be discussed in the mathematics community in the early 1970s,[39] and it is underlying the widely appreciated optimal (minimum-error) Bayesian classification algorithm for the passive perceptron problem.[14,4041]

We test the performance of this conceptually simple designing principle on small perceptrons of size N ≤ 25, for which the whole version space can be stored in the memory of a desktop computer. In Fig. 2, we show how the entropy density s and the generalization error ε change with the density α of training patterns. The entropy density measures the volume of the version space, s ≡ (1/N)log2P|. The generation error is computed as the probability that a randomly sampled binary test pattern will be mis-classified by a randomly chosen member J of the version space.[23] If the training patterns are randomly and independently drawn from the configuration space {−1, +1}N (i.e., the passive learning mode), the mean entropy density s and mean generalization error ε both decrease gradually with α and are positive when α exceeds unity. On the other hand if the training patterns are required to satisfy Eq. (6) we find that the mean entropy density s decreases linearly from s = 1 to s = 0 as α increases from 0 to 1, and at α = 1 the generalization error ε becomes exactly zero. In other words, perfect inference of the N-dimensional Ising vector T is achieved by the active strategy (6) with only N one-bit queries (which only return ±1 values). Compared with the deductive reasoning approach, log2N training patterns are saved following the designing principle (6). No algorithms can do better than this Bayesian strategy, because at least N measurements are needed to exactly fix an N-dimensional vector.

Fig. 2 The active learning strategy (6) outperforms passive learning on small Ising perceptrons: (a) entropy density s (in units of bit), (b) generalization error ε. Training patterns are added one after another, α = P/N is the instantaneous density of patterns (N is the number of input neurons and P is the number of training patterns). Each data point is obtained by averaging over 1,000 independent runs of the passive (Random) or the active (Design) learning algorithm.

These simulation results indicate that, in principle, perfect learning using only N training patterns is possible. But directly employing Eq. (6) to construct new training patterns is practically feasible only for small systems. When the dimension N becomes large the version space ΣP (for P small) will be too huge to enumerate. We must convert Eq. (6) into a form suitable for implementation in large systems. This issue is addressed in the remaining part of this section.

With respect to all the accumulated P training patterns at the end of the P-th learning stage, the volume of the version space ΣP (the partition function) is expressed as
where σμ is the true label of pattern ξμ, and Θ(x) is the Heaviside step function such that Θ(x) = 1 for x > 0 and Θ(x) = 0 for x ≤ 0. The probability of weight vectors J in ΣP is
Notice that this probability distribution depends on the details of the P training patterns. If J ∈ ΣP we have , otherwise . Besides this joint distribution of all the N entries Ji of J, we are also interested in single-weight marginals. The mean value of weight Ji among all the vectors of ΣP is
Similarly, the mean value of JiJj (pair correlation) is
When i = j we have 〈JiJiP = 1 due to the Ising nature of the weights. The training patterns bring correlations among the different weight variables. A consequence of these complicated correlations is that 〈JiJjP ≠ 〈JiPJjP for ij.

Now consider adding a new training pattern ξ to the perceptron. The distribution of the overlap q between ξ and the weight vectors J of ΣP is defined as
where δ(x) is the Dirac symbol such that δ(x) = 1 for x = 0 and δ(x) = 0 for x ≠ 0. From this definition we see that the mean overlap is
The variance of the overlap q is defined as . By a simple derivation we find that
The overlap q is the sum of N random terms ξiJi (randomness coming from Ji). As the lowest-order approximation we assume that the central limit theorem is valid for q when N is large, even through the weights Ji are not independent. In other words, we approximate the probability distribution (11) by a Gaussian distribution:
(The possible breaking down of this Gaussian assumption will be discussed in Sec. 6)

According to the designing principle (6), the (P + 1)-th training pattern should refute half of the weight vectors in ΣP. This means that the overlap between ξP+1 and the weight vectors of ΣP should be positive for half of the elements J ∈ ΣP and be negative for the remaining half. According to the Gaussian approximation (14), we see that the new pattern ξP+1 should have zero mean overlap value with the weight vectors of ΣP, that is
In comparison with Eq. (6), the advantage of Eq. (15) is that the student does not need to memorize all the candidate truth vectors J but only need to evaluate the N mean synaptic weights 〈JiP. Equation (15) may be regarded as a linearized version of Eq. (6) with the highly nonlinear sign function replaced by a linear function. In the next section we discuss how to efficiently update the mean weights during the online learning process. We notice that the constraint (15) is very similar in form to the designing constraint discussed in some of the early papers.[20,25] A significant difference is that our proposed constraint (15) involves the mean synaptic weights 〈JiP, instead of the synaptic weights Ji of a single weight vector stored in memory.

We employ simulated annealing[42] to sample a maximally random pattern ξP+1 under constraint (15). An energy penalty is defined for each Ising pattern ξ as
and the corresponding probability distribution of ξ is
where the parameter β is the inverse temperature. The pattern ξ evolves by single spin flips at slowly increasing β values. At each elementary update step, (i) a randomly chosen entry ξi of ξ is flipped to the opposite value; (ii) if the energy difference ΔEE(ξ′) − E(ξ) between the modified pattern ξ′ and the old pattern ξ is non-positive, then this flip ξi → −ξi is surely accepted, otherwise it is accepted only with probability eβΔE. After N such elementary flip trials the value of β is then elevated by a constant factor rβ (βrββ). Finally, the spin configuration after a total number tN of spin flip trials (β has been recursively elevated t times) is picked as the new training pattern ξP+1. In this work we set rβ = 1.1 and t = 100, and set the initial value of β to be 0.01. We have checked that the numerical results of the next two sections are insensitive to the particular values of these parameters.

4 Experience Accumulation

To exploit the designing principle (15) we must first compute 〈JiP for all the weight indices i. At P = 0 we know that 〈Ji0 = 0 for all the synaptic weights Ji. But the task for P ≥ 1 is quite non-trivial and can not be made exact. In the online learning paradigm we compute 〈JiP approximately by iteration.

After the training pattern ξP+1 has been added to the system, the new distribution of the version space is related to the old through
where is the label of ξP+1. The mean value of Ji under is
Here, is the probability of Ji = +1 in the version space ΣP, and is the complementary probability; and are, respectively, the conditional probabilities of ξP+1 being correctly classified by a weight vector J ∈ ΣP given that Ji = +1 and Ji = −1. The defining expressions for and are

where (respectively, ) denotes the version sub-space of ΣP which contains all the vectors J ∈ ΣP with Ji = +1 (respectively, Ji = −1). Notice that and . Similar to the derivation of Eq. (14) we can approximate and by two slightly different Gaussian integrals (details in Appendix A). Then we get the following convenient iterative equation for the mean weights
Here RP is a magnitude factor determined by
with the function f(x) being
The value of f(x) equals to unity at x = 0, and it then rapidly decays to zero as x increases. When x is negative f(x) rapidly approaches the asymptotic form −2x.

The iterative expression (21) agrees with the belief-propagation equation reported in Ref. [6] (see also Refs. [5,43]). Notice that if has the same (respectively, opposite) sign of σP+1, the value of 〈JiP+1 increases (respectively, decreases) from 〈JiP by an amount . Equation (21) therefore implements a specific Hebbian rule of experience accumulation. Notice also that 〈JiP = ±1 are two fixed points of Eq. (21), so if 〈JiP is wrongly estimated to be (say) −1 while the truth value is Ti = +1, there is no chance to correct this mistake by further learning. To ensure that the iteration process is able to escape from a wrong fixed point, we therefore slightly modify Eq. (21) as follows:
where the weighting function W(〈JiP) by default is equal to , but W(〈JiP) = W0 if the sign of is opposite to that of 〈JiP and at the same time . The precise value of the cutoff parameter has a weak effect on the learning performance. After some preliminary experiments we finally set W0 = 8 × 10−3.

To determine the numerical value of the magnitude factor RP, we need to compute the numerical value of the overlap variance Δ(ξP+1). The first summation of Eq. (13) is of order N and is easy to compute. On the other hand, it is quite complicated to compute the second summation of Eq. (13). In this work we make the simplest approximation of independence among different weight variables, that is, 〈JiJjP ≈ 〈JiPJjP. Under this additional approximation then
and it is completely independent of ξP+1. This independence approximate expression is commonly adopted in the literature (see, e.g., Refs. [6,44]), it amounts to approximate the probability distribution by a factorized form. To improve the numerical accuracy of computing the mean weight values, the next step is to include at least partially the correlations between the weight elements (see, e.g., discussions in Refs. [6,44] concerning the continuous Perceptron). We leave this demanding issue for future work. (More discussion is made in Sec. 6 on weight correlations.)

We now test the performance of the simple designing principle (15). The following straightforward inference rule is adopted in the computer simulations: At the end of the P-th learning stage, the inferred truth vector is simply the sign vector of the mean weights 〈JiP, with
The relative inference error is defined as the relative Hamming distance between the inferred weight vector and the teacher’s weight vector T, that is

Before testing the active learning mode, we first consider the random passive learning mode, in which every newly introduced training pattern is drawn independently and uniformly at random from the set of 2N Ising patterns. The mean value of the relative inference error of this passive mode, averaged over simulated independent online learning trajectories, is shown in Fig. 3(a) as a function of pattern density α. We find that when α < 5, the curves of mean inference error for different system sizes N are well superimposed onto each other, while for α ≥ 5 the mean inference error saturates to a small positive level whose height slightly increases with system size N (see inset of Fig. 3(a)).

Fig. 3 The performance of passive online learning. The P training patterns are fed to the student sequentially and they are independent random N-dimensional Ising vectors. The pattern density is α = P/N. The total number of simulated independent online learning trajectories is . (a) The mean inference error, i.e., the mean fraction of incorrectly inferred teacher weights. The inset shows the tail part of the numerical data in semi-logarithmic scale. (b) The success fraction, i.e., the fraction of simulation trajectories in which the inferred weight vector is identical to the teacher’s weight vector.

As another measure of performance we consider the success fraction, which is defined as the probability that the inferred weight vector is identical to the truth vector T in repeated independent run of the whole online learning process. For example, if in 4000 out of 10,000 independent runs then the success fraction is 0.4 at this particular pattern density α = P/N. Notice that the success fraction is not necessarily a monotonic function of α because it is possible that but . The success fraction of passive online learning is shown in Fig. 3(b) as a function of pattern density α. We find this fraction is first identical to zero for small α values, it then increases quickly when α exceeds 4 and finally fluctuates around a plateau level. Since the height of the plateau level decreases considerably with system size N, error-free inference will be impossible in the thermodynamic limit of N → ∞.

When constraint (15) is imposed in designing each new training pattern, we find that the learning performance is greatly enhanced. As shown in Fig. 4(a), the mean inference error reaches below 10−3 as the pattern density α is increased up to 2.2 and it then quickly drops to zero as α further increases slightly. The dramatic effect of active learning is most clearly demonstrated by the difference between Fig. 4(b) and Fig. 3(b). We see from Fig. 4(b) that the curve of success fraction becomes more and more sharper as the system size N increases, and all these different curves intersect at approximately the same value of α. Similar system size-dependent behaviors are commonly observed in finite-size scaling studies of continuous phase transitions.[45] We conjecture that a well-defined dynamical phase-transition to perfect inference will occur at the value of α ≈ 2.23 in the thermodynamic limit of N → ∞. More thorough theoretical investigation on the large N limit of this learning dynamics will be carried in a follow-up paper.

Fig. 4 Same as Fig. 3, but for active online learning under the designing principle (15). Error-free inference is achieved at pattern density α ≈ 2.23.
5 Additional Orthogonality Considerations

When a new training pattern ξP+1, constrained by Eq. (15) but otherwise being maximally random, is sampled by simulated annealing (Eq. (17)), it is conditionally independent of all the earlier training patterns given the values of 〈JjP. But since the mean weights 〈JjP are determined through the accumulative mechanism (24), Eq. (15) indeed brings complicated correlations between ξP+1 and all its predecessors. If ξP+1 happens to be relatively similar to some of the old training patterns, its power in promoting active inference will be compromised.[23,46] According to the geometric picture underlying the exact designing principle (6), it should be beneficial to explicitly require (at least approximate) orthogonality between ξP+1 and the training patterns introduced during the last M steps.

To implement these additional orthogonality constraints, we modify the energy function of the simulated annealing process as follows:
The second energy term is equal to the average squared overlap between ξ and an old pattern ξμ. The parameter λ controls the relative importance of the additional orthogonality constraints. In the present work we set λ = 1, and considering that there are N mutually orthogonal vectors in an N-dimensional space, we set M = N − 1. (We have not yet tried to optimize the values of λ and M.) Each new training pattern is then sampled by simulated annealing starting from an initial completely random pattern. The only difference is that the energy function in Eq. (17) now takes the form of Eq. (28).

The performance of the modified online learning algorithm is shown in Fig. 5. We find that error-free inference of the teacher’s weight vector can be achieved with high probability after encountering P ≈1.9N training patterns. Compared with the results of Fig. 4, the additional orthogonality considerations indeed lead to a remarkable boost to the learning efficiency.

Fig. 5 Same as Fig. 3 and Fig. 4, but for active online learning under the designing constraint (15) and the additional orthogonality constraints, see Eq. (28). The number of stored training patterns is set to be M = N−1.

It may be possible to further improve the learning performance by optimizing the parameters λ and M of Eq. (28). Furthermore, the energy function (28) may not necessarily be the best way to incorporate both the designing constraint (15) and the additional orthogonality constraints. For example, it may be even better to consider all the P old patterns (instead of only the last M ones) with non-uniform weighting factors.

6 Discussion

In this work we considered the Bayesian active learning principle (6) to infer the teacher’s weight vector of an N-dimensional Ising perceptron. Each new Ising training pattern is not randomly drawn as in passive learning but is designed with the aim of splitting the current version space into two equal sub-spaces. This designing principle was exactly implemented for small systems to achieve error-free inference using only N training samples (Fig. 2). When exhaustive enumeration becomes unfeasible for large systems, we derived a simple constraint (15) based on this principle and demonstrated that error-free inference is achievable with P ≈ 2.3N training samples (Fig. 4). The number of training samples was further reduced to P ≈1.9N after imposing additional orthogonality constraints on the training patterns (Fig. 5). On the other hand, the deductive reasoning algorithm discussed in this paper is guaranteed to achieve error-free inference with at most N+log2N queries and therefore is much superior to the Bayesian strategies discussed in Secs. 4 and 5.

In deriving the constraint Eq. (15) from the Bayesian principle (6), we have approximated the overlap probability profile (Eq. (11)) by a Gaussian distribution. Maybe this Gaussian assumption is only valid for sufficiently small values of the pattern density α. With the addition of training patterns, the volume of the version space becomes more or more small. At the same time the shape of the version space may become more and more irregular. One highly possible scenario is that, when α exceeds certain threshold value, the version space breaks up into many well-separated sub-spaces with each of them having a different set of mean weight values {〈JiP}. As a consequence, the overlap probability profile should be described as a weighted sum of many distinct Gaussian distribution functions (one for each version sub-space), and then Eq. (15) should be modified accordingly.

From the academic point of view, active learning in the presence of ergodicity breaking is a very interesting challenge. With an accurate approximation to the overlap probability profile , the efficiency of the active learning process may closely approach the theoretical limiting value of α = 1. We hope that significant theoretical and algorithmic progresses will be made in the near future on this important research issue.

Reference
[1] Rosenblatt F. Psychological Review 65 1958 386
[2] Watkin T. L. H. Rau A. Biehl M. Rev. Mod. Phys. 65 1993 499
[3] Engel A. Van den Broeck C. Statistical Mechanics of Learning Cambridge University Press Cambridge, UK 2001
[4] Krauth W. Mezard M. J. Phys. France 50 1989 3057
[5] Kabashima Y. Uda S. Lect. Notes Artif. Intellig. 3244 2004 479
[6] Braunstein A. Zecchina R. Phys. Rev. Lett. 96 2006 030201
[7] Huang H. Kabashima Y. Phys. Rev. E 90 2014 052813
[8] Huang H. Wong K. Y. M. Kabashima Y. J. Phys. A: Math. Theor. 46 2013 375002
[9] Baldassi C. Ingrosso A. Lucibello C. Saglietti L. Zecchina R. Phys. Rev. Lett. 115 2015 128101
[10] Obuchi T. Kabashima Y. J. Stat. Mech.: Theor. Exp. 2009 2009 12014
[11] Gyorgyi G. Phys. Rev. A 41 1990 7097
[12] Sompolinsky H. Tishby N. Seung H. S. Phys. Rev. Lett. 65 1990 1683
[13] Littlestone N. Warmuth M. K. The Weighted Majority Game Proceedings of the 30th Annual Symposium on the Foundation of Computer Science IEEE New York 1989 256
[14] Opper M. Haussler D. Phys. Rev. Lett. 66 1991 2677
[15] Seung H. S. Opper M. Sompolinsky H. Query by Committee Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory ACM New York 1992 287
[16] Opper M. Phys. Rev. Lett. 77 1996 4671
[17] Feng J. J. Phys. A: Math. Gen. 31 1998 4037
[18] Rosen-Zvi M. J. Phys. A: Math. Gen. 33 2000 7277
[19] Baldassi C. J. Stat. Phys. 136 2009 902
[20] Kinzel W. RuJan P. Europhys. Lett. 13 1990 473
[21] Baum E. B. IEEE Trans. Neural Networks 2 1991 5
[22] Hwang J. N. Choi J. J. Oh S. Marks R. J. II IEEE Trans. Neural Networks 2 1991 131
[23] Kinouchi O. Caticha N. J. Phys. A: Math. Gen. 25 1992 6243
[24] Seung H. S. Sompolinsky H. Tishby N. Phys. Rev. A 45 1992 6056
[25] Watkin T. L. H. Rau A. J. Phys. A: Math. Gen. 25 1992 113
[26] Kabashima Y. Shinomoto S. Acceleration of Learning in Binary Choice Problems Proceedings of the Sixth Annual Conference on Computational Learning Theory ACM New York 1993 446
[27] Sollich P. Saad D. Learning from Queries for Maximum Information Gain in Imperfectly Learnable Problems Tesauro G. Touretzky D. S. Leen T. K. Advances in Neural Information Processing Systems 7 1995 287 MIT Press Cambridge, MA
[28] Chen Y. Li X. Liu J. Ying Z. Applied Psychological Measurement 42 2018 24
[29] Box G. E. P. J. Amer. Stat. Assoc. 71 1976 791
[30] Robbins H. Bullet. Amer. Math. Soc. 58 1952 527
[31] LeCun Y. Bengio Y. Hinton G. E. Nature (London) 521 2015 436
[32] Goodfellow I. Bengio Y. Courville A. Deep Learning MIT Press Cambridge, MA 2016
[33] Huang A. Sheldan B. Sivak D. A. Thomson M. arXiv:cond-mat.stat-mech/1805.07512 2018
[34] Rupprecht N. Vural D. C. arXiv:cond-mat.stat-mech/1810.06620 2018
[35] Ueltzhoffer K. arXiv:q-bio.NC/1709.02341 2017
[36] Friston K. J. Lin M. Frith C. D. et al. Neural Computation 29 2017 2633
[37] Boufounos P. Baraniuk R. 1-Bit Compressive Sensing Proc. 42nd Annual Conference on Information Sciences and Systems IEEE 2008 16
[38] Xu Y. Kabashima Y. J. Stat. Mech.: Theor. Exp. 2013 2013 02041
[39] Barzdin J. M. Freivald R. V. Soviet Mathematics Doklady 13 1972 1224
[40] Angluin D. Machine Learning 2 1988 319
[41] Littlestone N. Machine Learning 2 1988 285
[42] Kirkpatrick S. Gelatt C. D. Jr. Vecchi M. P. Science 220 1983 671
[43] Mezard M. J. Phys. A: Math. Gen. 22 1989 2181
[44] Solla S. A. Winther O. Optimal Perceptron Learning: An Online Bayesian Approach Saad D. OnLine Learning in Neural Networks Cambridge University Press Cambridge, UK 1998 379
[45] Privman V. Finite Size Scaling and Numerical Simulation of Statistical Systems World Scientific Singapore 1990
[46] Shinzato T. Kabashima Y. J. Phys. A: Math. Theor. 42 2009 015005