Prove that every head-tail sequence is 50% equal to every other head-tail sequence of the same length (on average)
Definitions and assumptions
- A head-tail sequence of length n is a sequence of outcomes of tossing a normal coin n times
- Two ht-sequences of the length n are x percentage the same when \( \frac{\sum{no-of-indices-where-values-match}} {n} \times 100 \) = x %
The way I want to go about showing this is
- code that led to this chase
- Simulation via clojure code to show observations that further confirm the 50% match stands true for various sequence lengths from a combinatoric's view
- Theoretical proof
Initial chase
(defn gen-ht-sequence []
(repeatedly 10000 #(rand-nth [:h :t])))
(get (frequencies (map = (gen-ht-sequence) (gen-ht-sequence))) true) ;; always returns about 5000
The above code returned about half regardless of how large the length of the sequence was increased to be, so I thought it might have something to do with Probability theory's linked other: Combinatorics
Simulation
For this snippet you might want to put this library https://github.com/clojure/math.combinatorics in your project.clj
(defn match-percentage
"Takes two sequences and outputs the match percentage"
[x y]
(* 100
(/ (get (frequencies (map = x y)) true 0)
(float (count x)))))
(defn avg [l]
(/ (apply + l) (float (count l))))
;; main code, output is a list of percentages
;; take in a sequence length
;; generate all head-tail sequences of input length
;; compare each sequence to every other sequence for match percentage and aggregates this via an average
;; this average is calculated for every sequence
(defn experiment3 [n]
(let [s (combo/selections [:h :t] n)]
(pmap (fn [x]
(->> s
(map #(match-percentage x %))
avg)) s)))
Theoretical Proof
Lets's start by making the whole x % equivalent to a more simplified and formal definition:
Given X and Y are two sequences of equal length that have k indices where they match in value, let the function f be one that tells us the number of indices where they match in value i.e f(X,Y) = k
Picking a sequence from a list of sequences of length n (we call this list XS) of the same length, let's say this sequence is X
Now we want the set size of this expression \( \forall Y \in XS, f(X,Y)=k\)
this boils down a problem of combinatorics where you have a list of indices and you want to know how many combinations of length k can you obtain
The answer: \(C^n_k\)
Now because X and Y are of length n, the range of f is [0..n]
Hopefully, it comes as no surprise that the \(count(Y), \forall Y \in XS \) such that \(f(X,Y) = k, \forall k \in [0..n] \) is equal to \(2^n\)
If the reason isn't clear: let's explicitly look at the sum
\( C^{n}_{0} + C^{n}_{1} + C^{n}_{2} + C^{n}_{3} + ...\)
These are the coefficients of the general expansion of \((a+b)^n\) and to get the sum here we take a = 1 and b =1 and sence the sum is \((1 + 1)^n\) = \(2^n\) -- A
Now we want to weigh all these coefficients to calculate the average as the end goal, we know these coefficients signify the various ways X could be equivalent to a arbitary sequence Y in k ways, so we multiply each coefficient with the associated k and divide by n
\( C^{n}_{0}*(\frac{0}{n}) + C^{n}_{1}*(\frac{1}{n}) + C^{n}_{2}*(\frac{2}{n}) + C^{n}_{3}*(\frac{3}{n}) + C^{n}_{4}*(\frac{4}{n}) + ...\) -- B
\(\frac{B}{A}*100\) will give me the average
i.e \(\frac{ C^{n}_{0}*(\frac{0}{n}) + C^{n}_{1}*(\frac{1}{n}) + C^{n}_{2}*(\frac{2}{n}) + C^{n}_{3}*(\frac{3}{n}) + C^{n}_{4}*(\frac{4}{n}) + ...} {C^{n}_{0} + C^{n}_{1} + C^{n}_{2} + C^{n}_{3} + C^{n}_{4}...} * 100 \)
\( \frac{\sum\limits_{k=0}^n{C^{n}_k*(\frac{k}{n})}}{\sum\limits_{k=0}^n{C^{n}_k}}*100\)
I want to take a slight detour to make the numerator of the above expression easier to deal with
Lemma: \( C^{n}_k*(\frac{k}{n})\) = \( C^{n-1}_{k-1}\)
\( = C^{n}_k*(\frac{k}{n}) = \frac{n!}{k! (n - k)!}*\frac{k}{n} = \frac{n(n-1)!}{k(k-1)!(n-1-(k-1))!}* \frac{k}{n} = \frac{(n-1)!}{(k-1)!(n-1-(k-1))!} = C^{n-1}_{k-1} \)
Now that we're done with the lemma, I do want to make a slight adjustment to the equation before the lemma, so that we aren't left with doubts about the limits on the summation sign
\( = \frac{0 + \sum\limits_{k=1}^n{C^{n}_k*(\frac{k}{n})}}{\sum\limits_{k=0}^n{C^{n}_k}}*100\)
\( = \frac{0 + \sum\limits_{k=1}^n{C^{n-1}_{k-1}}}{\sum\limits_{k=0}^n{C^{n}_k}}*100 \)
\(= \frac{0 + \sum\limits_{k=0}^{n-1}{C^{n-1}_{k}}}{\sum\limits_{k=0}^n{C^{n}_k}}*100 = \frac{2^{n-1}}{2^n}*100 \)
= 50%