Counting reachable labels: canonical breaking coefficients
Verbally defining breakdown coefficients seems simple at first glance:
Given a class of hypotheses h, is northᵗʰ rupture coefficient, denoted Sₙ(h), represents the greatest number of labels that classifiers can achieve in h in a sample of north feature vectors.
But what is a “labelled”? And what does it do?realizable”? Answering those questions will help us lay some foundations in the search for a more formal definition.
In the context of binary classification, a labelled of a sample of feature vectors is simply any of the ways we can assign values from the set {-1, 1} to those vectors. As a very simple example, consider two one-dimensional feature vectors (i.e. points on a number line), x₁ = 1 and x₂ = 2.
The possible labels are any combination of the classification values that we can assign to the individual feature vectors independent of each other. We can represent each label as a vector, where the first and second coordinates represent the values assigned to x₁ and x₂, respectively. The set of possible labels is therefore {(-1, -1), (-1, 1), (1, -1), (1, 1)}. Note that a sample of size 2 produces 2² = 4 possible labelings; We will soon see how this generalizes to samples of arbitrary size.
We say that a label is realizable for a kind of hypothesis h if there is a classifier h ∈ h where that labeling can come from. Continuing with our simple example, suppose we are limited to classifiers of the form x ≥ kk ∈ ℝ, that is, one-dimensional thresholds such that anything to the right of the threshold is classified positively. This class of hypothesis cannot achieve (1, -1) labeling. x₂ being greater than x₁ implies that any threshold that classifies x₁ positively should do the same for x₂. Therefore, the set of reachable labels is {(-1, -1), (-1, 1), (1, 1)}.
Having understood the basic terminology, we can begin to develop some notation to formally express elements of the verbal definition we started with.
We simply represent the labels as vectors as we did in our simple example, where each coordinate represents the classification value assigned to the corresponding feature vector. There are 2ⁿ possible labels in total: There are two possible options for each feature vector, and we can think of a label as a collection of north such elections, each one of them independently of the rest. If a class of hypothesis h can achieve all possible labeling of a sample 𝒞ₙ, that is, if the number of realizable labeled 𝒞ₙ is equal to 2ⁿ, We say that H shatters 𝒞ₙ.
Finally, using the previous notation, we converge on a more rigorous definition of Sₙ(h):
According to our explanation of the destruction, Sₙ(h) equal to 2ⁿ implies that there is a sample size north that is destroyed by h.
Estimating the expressiveness of the hypothesis class: canonical dimension of VC
The Vapnik-Chervonenkis (VC) dimension is a way to measure the expressive power of a class of hypotheses. It builds on the idea of shattering that we just defined and plays an important role in helping us determine which kinds of hypotheses can be learned with PAC and which cannot.
Let's start by trying to intuitively define the canonical dimension of VC:
Given a class of hypotheses hits dimension VC, denoted as VCdim(h), is defined as the largest natural number north for which there is a sample size north that is broken by h.
Wearing Sₙ(h) allows us to express this much more cleanly and succinctly:
VCdim(h) = max{ north ∈ ℕ : Sₙ(h) = 2ⁿ }
However, this definition is not precise. Note that the set of numbers for which the destruction coefficient is equal to 2ⁿ It can be infinite. (Consequently, it is possible that VCdim(h) = ∞.) If that is the case, the set does not have a well-defined maximum. We address this by taking the supremum instead:
VCdim(h) = sip{ north ∈ ℕ : Sₙ(h) = 2ⁿ }
This rigorous and concise definition is the one we will use in the future.
Adding preferences to the mix: devastating strategic coefficients
Generalizing the canonical notions we just reviewed to work in a strategic setting is quite simple. Redefine devastating coefficients in terms of data point best answer What we defined in the previous article is practically all we will have to do.
Given a class of hypotheses h, a set of preferences Rand a cost function c, he northᵗʰ devastating coefficient of Sᴛʀᴀᴄ⟨h, R, c⟩, denoted σₙ(H, R, c), It represents the greatest number of labels that classifiers can achieve in h in a set of north potentially manipulated feature vectors, i.e. north The data points to the best responses.
As a reminder, here's how we define the best response to the data point:
We can modify the notation we used in our discussion of canonical breaking coefficients to further formalize this:
The main difference is that each x in the sample it must have a corresponding r. Other than that, putting the data point with the best response where we had x in the canonical case works without a problem.
As a quick sanity check, let's consider what happens if R = artificial intelligence. The realized reward term 𝕀(h(z) = 1) ⋅ r will be 0 for all data points. Maximizing profit thus becomes synonymous with minimizing costs.. The best way to minimize the cost incurred for a data point is trivial: never manipulate your feature vector.
D(x, r; h) always ends up being alone x, placing us firmly within the territory of canonical classification. It follows that σₙ(hartificial intelligence, c) =Sₙ(h) for all h,c. This is consistent with our observation that the unbiased preference class represented by R = { 0 } is equivalent to canonical binary classification.
Expressiveness with Preferences: Strategic Dimension of VC (SVC)
Having defined the northᵗʰ strategic destruction coefficient, we can simply change the Sₙ(h) in the canonical definition of the VC dimension for σₙ(h, R, c).
CVS(H, R, c) = sip{ north ∈ ℕ :pₙ(H, R, c) = 2ⁿ }
According to the example we considered above, we find that SVC(hartificial intelligence, c) = VCdim(h) For any h, c. Indeed, SVC is to VCdim what the strategic destruction coefficient is to its canonical equivalent: both are elegant generalizations of non-strategic concepts.
From SVC to PAC strategic learning capability: the fundamental theorem of strategic learning
Now we can use SVC to state the Fundamental Theorem of Strategic Learning, which relates the complexity of a strategic classification problem to its (PAC-agnostic) learnability.
An instance of strategic classification Sᴛʀᴀᴄ⟨h, R, c⟩ is an agnostic PAC that can be learned if and only if SVC(H, R, c) Is finite. He sample complexity for PAC agnostic strategic learning is meter(d, my) ≤ Cε ⁻² ⋅ (SVC(H, R, c) + log in(1/d))with c being a constant.
We won't go into much detail about how this can be proven. Suffice it to say that it all comes down to a clever reduction to the (well documented) fundamental theorem of Statistical Learning, which is essentially the nonstrategic version of the theorem. If you are mathematically inclined and interested in the practical aspects of the test, you can find them at Appendix B of the document.
This theorem essentially completes our generalization of classical PAC learning to a strategic classification environment. It shows that the way we define SVC actually doesn't just make sense in our heads; it actually works as a generalization of VCDim where it matters most. Armed with the fundamental theorem, we are well equipped to analyze strategic classification problems as we would any old binary classification problem. Having the ability to determine whether a strategic problem can be learned theoretically or not is pretty incredible, in my opinion.