Prove that every head-tail sequence is 50% equal to every other head-tail sequence of the same length (on average)

Posted on October 10, 2021

Definitions and assumptions

A head-tail sequence of length n is a sequence of outcomes of tossing a normal coin n times
Two ht-sequences of the length n are x percentage the same when \( \frac{\sum{no-of-indices-where-values-match}} {n} \times 100 \) = x %

The way I want to go about showing this is

code that led to this chase
Simulation via clojure code to show observations that further confirm the 50% match stands true for various sequence lengths from a combinatoric's view
Theoretical proof

Initial chase


(defn gen-ht-sequence []
  (repeatedly 10000 #(rand-nth [:h :t])))

(get (frequencies (map = (gen-ht-sequence) (gen-ht-sequence))) true) ;; always returns about 5000

The above code returned about half regardless of how large the length of the sequence was increased to be, so I thought it might have something to do with Probability theory's linked other: Combinatorics

Simulation

For this snippet you might want to put this library https://github.com/clojure/math.combinatorics in your project.clj


(defn match-percentage
  "Takes two sequences and outputs the match percentage"
  [x y]
   (* 100
    (/ (get  (frequencies (map = x y)) true 0)
      (float (count x)))))

(defn avg [l]
  (/ (apply + l) (float (count l))))

;; main code, output is a list of percentages
;; take in a sequence length
;; generate all head-tail sequences of input length
;; compare each sequence to every other sequence for match percentage and aggregates this via an average
;; this average is calculated for every sequence
(defn experiment3 [n]
  (let [s (combo/selections [:h :t] n)]
    (pmap (fn [x]
            (->> s
                 (map #(match-percentage x %))
                 avg)) s)))

Theoretical Proof

Lets's start by making the whole x % equivalent to a more simplified and formal definition:

Given X and Y are two sequences of equal length that have k indices where they match in value, let the function f be one that tells us the number of indices where they match in value i.e f(X,Y) = k

Picking a sequence from a list of sequences of length n (we call this list XS) of the same length, let's say this sequence is X

Now we want the set size of this expression \( \forall Y \in XS, f(X,Y)=k\)

this boils down a problem of combinatorics where you have a list of indices and you want to know how many combinations of length k can you obtain

The answer: \(C^n_k\)

Now because X and Y are of length n, the range of f is [0..n]

Hopefully, it comes as no surprise that the \(count(Y), \forall Y \in XS \) such that \(f(X,Y) = k, \forall k \in [0..n] \) is equal to \(2^n\)

If the reason isn't clear: let's explicitly look at the sum

\( C^{n}_{0} + C^{n}_{1} + C^{n}_{2} + C^{n}_{3} + ...\)

These are the coefficients of the general expansion of \((a+b)^n\) and to get the sum here we take a = 1 and b =1 and sence the sum is \((1 + 1)^n\) = \(2^n\) -- A

Now we want to weigh all these coefficients to calculate the average as the end goal, we know these coefficients signify the various ways X could be equivalent to a arbitary sequence Y in k ways, so we multiply each coefficient with the associated k and divide by n

\( C^{n}_{0}*(\frac{0}{n}) + C^{n}_{1}*(\frac{1}{n}) + C^{n}_{2}*(\frac{2}{n}) + C^{n}_{3}*(\frac{3}{n}) + C^{n}_{4}*(\frac{4}{n}) + ...\) -- B

\(\frac{B}{A}*100\) will give me the average

i.e \(\frac{ C^{n}_{0}*(\frac{0}{n}) + C^{n}_{1}*(\frac{1}{n}) + C^{n}_{2}*(\frac{2}{n}) + C^{n}_{3}*(\frac{3}{n}) + C^{n}_{4}*(\frac{4}{n}) + ...} {C^{n}_{0} + C^{n}_{1} + C^{n}_{2} + C^{n}_{3} + C^{n}_{4}...} * 100 \)

\( \frac{\sum\limits_{k=0}^n{C^{n}_k*(\frac{k}{n})}}{\sum\limits_{k=0}^n{C^{n}_k}}*100\)

I want to take a slight detour to make the numerator of the above expression easier to deal with

Lemma: \( C^{n}_k*(\frac{k}{n})\) = \( C^{n-1}_{k-1}\)

\( = C^{n}_k*(\frac{k}{n}) = \frac{n!}{k! (n - k)!}*\frac{k}{n} = \frac{n(n-1)!}{k(k-1)!(n-1-(k-1))!}* \frac{k}{n} = \frac{(n-1)!}{(k-1)!(n-1-(k-1))!} = C^{n-1}_{k-1} \)

Now that we're done with the lemma, I do want to make a slight adjustment to the equation before the lemma, so that we aren't left with doubts about the limits on the summation sign

\( = \frac{0 + \sum\limits_{k=1}^n{C^{n}_k*(\frac{k}{n})}}{\sum\limits_{k=0}^n{C^{n}_k}}*100\)

\( = \frac{0 + \sum\limits_{k=1}^n{C^{n-1}_{k-1}}}{\sum\limits_{k=0}^n{C^{n}_k}}*100 \)

\(= \frac{0 + \sum\limits_{k=0}^{n-1}{C^{n-1}_{k}}}{\sum\limits_{k=0}^n{C^{n}_k}}*100 = \frac{2^{n-1}}{2^n}*100 \)