4  Estimate Distribution Parameters

Given \(S\), and a biomarker’s measurements, how can we estimate \(\mathcal N(\theta_{\mu}, \theta_{\sigma})\) and \(\mathcal N(\phi_{\mu}, \phi_{\sigma})\)?

import pandas as pd
import numpy as np
import altair as alt 
import math 
from scipy.stats import norm
from sklearn.cluster import AgglomerativeClustering
from typing import Dict
import json 
from sklearn.cluster import KMeans
from collections import defaultdict
from scipy.stats import mode
output_dir = 'data'
df = pd.read_csv(f"{output_dir}/150|200_3.csv")
biomarkers = df.biomarker.unique()
idx = 1
biomarker_df = df[df.biomarker==biomarkers[idx]]
biomarker_df.sample(10)
participant biomarker measurement k_j S_n affected_or_not diseased
307 107 PCC-FCI 2.875619 10 2 affected True
360 160 PCC-FCI 14.765535 0 2 not_affected False
365 165 PCC-FCI 8.913005 0 2 not_affected False
354 154 PCC-FCI 11.380722 0 2 not_affected False
200 0 PCC-FCI 6.567299 0 2 not_affected False
285 85 PCC-FCI 11.371021 0 2 not_affected False
375 175 PCC-FCI 3.771693 9 2 affected True
203 3 PCC-FCI 14.266635 0 2 not_affected False
295 95 PCC-FCI 4.949612 0 2 not_affected False
299 99 PCC-FCI 17.490539 0 2 not_affected False
biomarker_df.shape
(200, 7)

4.1 Hard K-Means

Tip

To use this algorithm, we only need to know (1) whether this participant is diseased; and (2) each biomarker measurement.

The first method we can use is hard K-Means. We clustering a certain biomarker’s measurements into two clusters. A clustering is successful if:

  • There are two, and only two clusters.
  • Each clustes has more than one element (This is to make sure that the standard deviation of this biomarker’s theta or phi is non-zero)

Ideally, we wanted all healthy participants to be grouped into a single cluster, which is why we initially tried using the constrained K-Means algorithm implemented by Babaki (2017). However, the algorithm did not work as intended.

We therefore designed a hard k-means algorithm to satisfy our needs:

  • We try hard K-Means multiple times; If the two above mentioned requirements are not met, then
  • We group the measurements into two random clusters; If the two above mentioned requirements are still not met, then raise an error and stop.
Code
def compute_theta_phi_for_biomarker(biomarker_df, max_attempt = 50, seed = None):
    """get theta and phi parameters for this biomarker
    input: 
        - biomarker_df: a pd.dataframe of a specific biomarker
    output: 
        - a tuple: theta_mean, theta_std, phi_mean, phi_std
    """
    if seed is not None:
        # Set the seed for numpy's random number generator
        rng = np.random.default_rng(seed)
    else:
        rng = np.random

    n_clusters = 2
    measurements = np.array(biomarker_df['measurement']).reshape(-1, 1)
    healthy_df = biomarker_df[biomarker_df['diseased'] == False]

    # clustering = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    # predictions = clustering.fit_predict(measurements)
    
    # # Verify that AgglomerativeClustering produced exactly 2 clusters with more than 1 member each
    # cluster_counts = np.bincount(predictions) # array([3, 2])
    # if len(cluster_counts) != n_clusters or any(c <= 1 for c in cluster_counts):
    #     print("AgglomerativeClustering did not yield the required clusters, switching to KMeans.")
    # # If AgglomerativeClustering fails, attempt KMeans with a max_attempt limit
    
    curr_attempt = 0
    n_init_value = 10
    clustering_setup = KMeans(n_clusters=n_clusters, n_init=n_init_value)
    
    while curr_attempt < max_attempt:
        clustering_result = clustering_setup.fit(measurements)
        predictions = clustering_result.labels_
        cluster_counts = np.bincount(predictions) # array([3, 2])
        
        if len(cluster_counts) == n_clusters and all(c > 1 for c in cluster_counts):
            break 
        curr_attempt += 1
    else:
        print(f"KMeans failed. Try randomizing the predictions")
        predictions = rng.choice([0, 1], size=len(measurements))
        cluster_counts = np.bincount(predictions)
        if len(cluster_counts) != n_clusters or not all(c > 1 for c in cluster_counts):
            raise ValueError(f"KMeans clustering failed to find valid clusters within max_attempt.")
    
    healthy_predictions = predictions[healthy_df.index]
    mode_result = mode(healthy_predictions, keepdims=False).mode
    phi_cluster_idx = mode_result[0] if isinstance(mode_result, np.ndarray) else mode_result
    theta_cluster_idx = 1 - phi_cluster_idx

    # two empty clusters to strore measurements
    clustered_measurements = [[] for _ in range(2)]
    # Store measurements into their cluster
    for i, prediction in enumerate(predictions):
        clustered_measurements[prediction].append(measurements[i][0])
    
     # Calculate means and standard deviations
    theta_mean, theta_std = np.mean(
        clustered_measurements[theta_cluster_idx]), np.std(
            clustered_measurements[theta_cluster_idx])
    phi_mean, phi_std = np.mean(
        clustered_measurements[phi_cluster_idx]), np.std(
            clustered_measurements[phi_cluster_idx])
    
    # Check for invalid values
    if any(np.isnan(v) or v == 0 for v in [theta_std, phi_std, theta_mean, phi_mean]):
        raise ValueError("One of the calculated values is invalid (0 or NaN).")

    return theta_mean, theta_std, phi_mean, phi_std

def get_theta_phi_estimates(
    data: pd.DataFrame,
) -> Dict[str, Dict[str, float]]:
    """
    Obtain theta and phi estimates (mean and standard deviation) for each biomarker.

    Args:
    data (pd.DataFrame): DataFrame containing participant data with columns 'participant', 
        'biomarker', 'measurement', and 'diseased'.
    # biomarkers (List[str]): A list of biomarker names.

    Returns:
    Dict[str, Dict[str, float]]: A dictionary where each key is a biomarker name,
        and each value is another dictionary containing the means and standard deviations 
        for theta and phi of that biomarker, with keys 'theta_mean', 'theta_std', 'phi_mean', 
        and 'phi_std'.
    """
    # empty hashmap of dictionaries to store the estimates
    estimates = {}
    biomarkers = data.biomarker.unique()
    for biomarker in biomarkers:
        # Filter data for the current biomarker
        # reset_index is necessary here because we will use healthy_df.index later
        biomarker_df = data[data['biomarker']
                            == biomarker].reset_index(drop=True)
        theta_mean, theta_std, phi_mean, phi_std = compute_theta_phi_for_biomarker(
            biomarker_df)
        estimates[biomarker] = {
            'theta_mean': theta_mean,
            'theta_std': theta_std,
            'phi_mean': phi_mean,
            'phi_std': phi_std
        }
    return estimates
hard_kmeans_estimates = get_theta_phi_estimates(data = df)
hard_kmeans_estimates_df = pd.DataFrame.from_dict(
    hard_kmeans_estimates, orient='index')
hard_kmeans_estimates_df.reset_index(names = 'biomarker', inplace=True)
hard_kmeans_estimates_df
biomarker theta_mean theta_std phi_mean phi_std
0 HIP-FCI -8.587833 5.365053 4.903437 2.008974
1 PCC-FCI 7.288870 2.797150 14.569768 2.274322
2 AB 173.199732 33.778905 277.009573 34.503492
3 P-Tau -49.329924 15.988080 -15.568346 12.688703
4 MMSE 22.272803 1.624078 28.011392 0.805532
5 ADAS -20.582091 3.853234 -6.002048 1.487443
6 HIP-GMI 0.576991 0.157259 0.192231 0.128850
7 AVLT-Sum 27.808002 8.561387 52.524815 7.989840
8 FUS-GMI 0.519974 0.047474 0.628440 0.038051
9 FUS-FCI -7.061708 1.928477 -12.647345 2.927590
with open('files/real_theta_phi.json', 'r') as f:
    truth = json.load(f)
truth_df = pd.DataFrame.from_dict(truth, orient='index')
truth_df.reset_index(names = 'biomarker', inplace=True)
truth_df
biomarker theta_mean theta_std phi_mean phi_std
0 MMSE 22.0 2.666667 28.0 0.666667
1 ADAS -20.0 4.000000 -6.0 1.333333
2 AB 150.0 16.666667 250.0 50.000000
3 P-Tau -50.0 33.333333 -25.0 16.666667
4 HIP-FCI -5.0 6.666667 5.0 1.666667
5 HIP-GMI 0.3 0.333333 0.4 0.233333
6 AVLT-Sum 20.0 6.666667 40.0 15.000000
7 PCC-FCI 5.0 3.333333 12.0 4.000000
8 FUS-GMI 0.5 0.066667 0.6 0.066667
9 FUS-FCI -20.0 6.000000 -10.0 3.333333

Now let’s compare the results using plots:

Code
def obtain_theta_phi_params(biomarker, estimate_df, truth):
    '''This is to obtain both true and estimated theta and phi params for each biomarker '''
    biomarker_data_est = estimate_df[estimate_df.biomarker == biomarker].reset_index()
    biomarker_data = truth[truth.biomarker == biomarker].reset_index()
    # theta for affected
    theta_mean_est = biomarker_data_est.theta_mean[0]
    theta_std_est = biomarker_data_est.theta_std[0]

    theta_mean = biomarker_data.theta_mean[0]
    theta_std = biomarker_data.theta_std[0]

    # phi for not affected
    phi_mean_est = biomarker_data_est.phi_mean[0]
    phi_std_est = biomarker_data_est.phi_std[0]

    phi_mean = biomarker_data.phi_mean[0]
    phi_std = biomarker_data.phi_std[0]

    return theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est

def make_chart(biomarkers, estimate_df, truth, title):
    alt.renderers.enable('png')
    charts = []
    for biomarker in biomarkers: 
        theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
        biomarker, estimate_df, truth)
        mean1, std1 = theta_mean, theta_std
        mean2, std2 = theta_mean_est, theta_std_est

        # Generating points on the x axis
        x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                        max(mean1 + 3*std1, mean2 + 3*std2), 1000)

        # Creating DataFrames for each distribution
        df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
        df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

        # Combining the DataFrames
        df3 = pd.concat([df1, df2])

        # Altair plot
        chart_theta = alt.Chart(df3).mark_line().encode(
            x='x',
            y='pdf',
            color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
        ).properties(
            title=f'{biomarker}, Theta',
            width=100,
            height=100
            )

        mean1, std1 = phi_mean, phi_std
        mean2, std2 = phi_mean_est, phi_std_est

        # Generating points on the x axis
        x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                        max(mean1 + 3*std1, mean2 + 3*std2), 1000)

        # Creating DataFrames for each distribution
        df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
        df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

        # Combining the DataFrames
        df3 = pd.concat([df1, df2])

        # Altair plot
        chart_phi = alt.Chart(df3).mark_line().encode(
            x='x',
            y='pdf',
            color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
        ).properties(
            title=f'{biomarker}, Phi',
            width=100,
            height=100
            )
        
        # Concatenate theta and phi charts horizontally
        hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

        # Append the concatenated chart to the list of charts
        charts.append(hconcat_chart)
    # Concatenate all the charts vertically
    final_chart = alt.vconcat(*charts).properties(title = title)

    # Display the final chart
    final_chart.display()
make_chart(
    biomarkers[0:5], 
    hard_kmeans_estimates_df, 
    truth_df, 
    title = "Comparing Theta and Phi Distributions Using hard k-means"
)
Figure 4.1: Comparing Theta and Phi Distributions Using Simple Clusering

It turns out the result is not very desriable.

4.2 Conjugate Priors

The second method we may utilize is conjugate priors. Conjugacy occurs when the posterior distribution is in the same family of distribution as the prior distribution, but with new parameter values.

Why conjugacy is important? Because without it, one has to do the integration, which oftentimes is hard.

Three major conjugate families:

  • Beta-Binomial
  • Gamma-Poisson
  • Normal-Normal

In our example, we assume that the measurement data for each biomarker follows a normal distribution; however, we do not know the exact \(\mu\) and \(\sigma\). Our job is to estimate the two parameters for each biomarker based on the data we have.

According to An Introduction to Bayesian Thinking by Clyde et al. (2022), if the data comes from a normal distribution with unknown \(\mu\) and \(\sigma\), the conjugate prior for \(\mu\) has a normal distribution with mean \(m_0\) and variance \(\frac{\sigma^2}{n_0}\). The conjugate prior for \(\frac{1}{\sigma^2}\) has a Gamma distribution with shape \(\frac{v_0}{2}\) and rate \(\frac{v_0 s_0^{2}}{2}\) where

  • \(m_0\): prior estimate of \(\mu\).
  • \(n_0\): how strongly is the prior belief in \(m_0\) is held.
  • \(s_0^2\): prior estimate of \(\sigma^2\).
  • \(v_0\): prior degress of freedome, influencing the certainty of \(s_0^2\).

That is to say:

\[\mu | \sigma^2 \sim \mathcal{N}(m_0, \sigma^2/n_0)\]

\[1/\sigma^2 \sim Gamma\left(\frac{v_0}{2}, \frac{v_0 s_0^2}{2} \right)\]

Combined, we have:

\[(\mu, 1/\sigma^2) \sim NormalGamma(m_0, n_0, s_0^2, v_0)\]

The posterior also follows a Normal-Gamma distribution:

\[(\mu, 1/\sigma^2) | data \sim NormalGamma(m_n, n_n, s_n^2, v_n)\]

More specifically

\[1/\sigma^2 | data \sim Gamma(v_n/2, s_n^2 v_n/2)\]

\[\mu | data, \sigma^2 \sim \mathcal{N}(m_n, \sigma^2/n_n)\]

Based on the above two equations, we know that the mean of posterior mean is \(m_n\) and the mean of the posterior variance is \((s_n^2 v_n/2)/(v_n/2)\). This is beceause the expected value of \(Gamma(\alpha, \beta)\) is \(\frac{\alpha}{\beta}\).

where

  • \(m_n\): posterior mean, mode, and median for \(\mu\)
  • \(n_n\): posterior sample size
  • \(s_n^2\): posterior variance
  • \(v_n\): posterior degrees of freedome

The updating rules to get the new hyper-parameters:

\[m_n = \frac{n}{n+n_0} \bar{y} + \frac{n_0}{n+n_0}m_0\]

\[n_n = n_0 + n\]

\[v_n = v_0 + n\]

\[s_n^2 = \frac{1}{v_n}\left[s^2(n-1) + s_0^2v_0 + \frac{n_0n}{n_n}(\bar{y}-m_0)^2\right]\]

where

  • \(n\): sample size
  • \(\bar{y}\): sample mean
  • \(s^2\): sample variance
Tip

To apply the algorithm of conjugate priors, we assume we already know \(S\) and \(k_j\), alongside biomarker measurement (\(X_{nj}\)). Based on \(S\) and \(k_j\), we can infer whether a biomarker is affected by the disease or not.

Code
def estimate_params_exact(m0, n0, s0_sq, v0, data):
    '''This is to estimate means and vars based on conjugate priors
    Inputs:
        - data: a vector of measurements 
        - m0: prior estimate of $\mu$.
        - n0: how strongly is the prior belief in $m_0$ is held.
        - s0_sq: prior estimate of $\sigma^2$.
        - v0: prior degress of freedome, influencing the certainty of $s_0^2$.

    Outputs:
        - mu estiate, std estimate
    '''
    # Data summary
    sample_mean = np.mean(data)
    sample_size = len(data)
    sample_var = np.var(data, ddof=1)  # ddof=1 for unbiased estimator

    # Update hyperparameters for the Normal-Inverse Gamma posterior
    updated_m0 = (n0 * m0 + sample_size * sample_mean) / (n0 + sample_size)
    updated_n0 = n0 + sample_size
    updated_v0 = v0 + sample_size
    updated_s0_sq = (1 / updated_v0) * ((sample_size - 1) * sample_var + v0 * s0_sq +
                                        (n0 * sample_size / updated_n0) * (sample_mean - m0)**2)
    updated_alpha = updated_v0/2
    updated_beta = updated_v0*updated_s0_sq/2

    # Posterior estimates
    mu_posterior_mean = updated_m0
    sigma_squared_posterior_mean = updated_beta/updated_alpha

    mu_estimation = mu_posterior_mean
    std_estimation = np.sqrt(sigma_squared_posterior_mean)

    return mu_estimation, std_estimation

def get_theta_phi_conjugate_priors(biomarkers, data_we_have, theta_phi_kmeans):
    '''To get estimated parameters, returns a hashmap
    Input:
    - biomarkers: biomarkers 
    - data_we_have: participants data filled with initial or updated participant_stages
    - theta_phi_kmeans: a hashmap of dicts, which are the prior theta and phi values
        obtained from the initial hard k-means algorithm

    Output: 
    - a hashmap of dictionaries. Key is biomarker name and value is a dictionary.
    Each dictionary contains the theta and phi mean/std values for a specific biomarker. 
    '''
    # empty list of dictionaries to store the estimates
    hashmap_of_means_stds_estimate_dicts = {}

    for biomarker in biomarkers:
        # Initialize dictionary outside the inner loop
        dic = {'biomarker': biomarker}
        for affected in ['affected', 'not_affected']:
            data_full = data_we_have[(data_we_have.biomarker == biomarker) & (
                data_we_have.affected_or_not == affected)]
            if len(data_full) > 1:
                measurements = data_full.measurement
                s0_sq = np.var(measurements, ddof=1)
                m0 = np.mean(measurements)
                mu_estimate, std_estimate = estimate_params_exact(
                    m0=m0, n0=1, s0_sq=s0_sq, v0=1, data=measurements)
                if affected == 'affected':
                    dic['theta_mean'] = mu_estimate
                    dic['theta_std'] = std_estimate
                else:
                    dic['phi_mean'] = mu_estimate
                    dic['phi_std'] = std_estimate
            # If there is only one observation or not observation at all, resort to theta_phi_kmeans
            # YES, IT IS POSSIBLE THAT DATA_FULL HERE IS NULL
            # For example, if a biomarker indicates stage of (num_biomarkers), but all participants' stages
            # are smaller than that stage; so that for all participants, this biomarker is not affected
            else:
                print('not enough data here, so we have to use theta phi estimates from hard k-means')
                # print(theta_phi_kmeans)
                if affected == 'affected':
                    dic['theta_mean'] = theta_phi_kmeans[biomarker]['theta_mean']
                    dic['theta_std'] = theta_phi_kmeans[biomarker]['theta_std']
                else:
                    dic['phi_mean'] = theta_phi_kmeans[biomarker]['phi_mean']
                    dic['phi_std'] = theta_phi_kmeans[biomarker]['phi_std']
        # print(f"biomarker {biomarker} done!")
        hashmap_of_means_stds_estimate_dicts[biomarker] = dic
    return hashmap_of_means_stds_estimate_dicts
conjugate_prior_theta_phi = get_theta_phi_conjugate_priors(
    biomarkers = biomarkers, 
    data_we_have = df, 
    theta_phi_kmeans = hard_kmeans_estimates
)
cp_df = pd.DataFrame.from_dict(conjugate_prior_theta_phi, orient='index')
cp_df.reset_index(drop=True, inplace=True)
cp_df
biomarker theta_mean theta_std phi_mean phi_std
0 HIP-FCI -5.378366 7.233991 5.092800 1.514402
1 PCC-FCI 5.521792 2.777207 12.071769 3.671679
2 AB 151.143708 14.806694 251.973564 51.382188
3 P-Tau -41.768257 34.857945 -24.739527 14.928907
4 MMSE 23.122406 2.446874 28.049683 0.718493
5 ADAS -19.633304 4.582900 -5.902198 1.278311
6 HIP-GMI 0.425625 0.272876 0.379542 0.235348
7 AVLT-Sum 21.664360 3.755735 40.700638 14.480463
8 FUS-GMI 0.482745 0.055585 0.590434 0.063730
9 FUS-FCI -18.566905 5.781937 -9.648705 3.099195
Note

When we estimate \(\theta\) and \(\phi\) using conjugate priors, we need to use the result from hard k-means as a fall back because it is possible that for a specific biomarker, either the affected or the not_affected group is empty. If that is the case, we are not able to estimate relevant parameters and have to resort to the fallback result.

make_chart(
    biomarkers[0:5], 
    cp_df, 
    truth_df, 
    title = "Comparing Theta and Phi Distributions Using Conjugate Priors"
)
Figure 4.2: Comparing Theta and Phi Distributions Using Conjugate Prior

4.3 Soft K-Means

Conjugate Priors assumes we know \(k_j\), which often times is not already known. Our hard k-means algorithm is only taking advantage of \(X_{nj}\) and whether participants are diseased or not, leaving \(S\), which is known to us, unexploited.

Soft K-Means is a good alternative to these two because it utilizes \(S\) while at the same time do not assume we know \(k_j\).

The logic of soft-kmeans is this;

  1. If a participant is diseased, we iterate through all possible disease stages, and calculate the associated likelihood using Equation 2.1. We then normalize these likelihoods to obtain the estimated probability of this participant being at each stage. For example, if there are three possible stages, and the associated likelihoods are [1, 3, 6], then the normalized likelihoods would be [0.1, 0.3, 0.6].
Tip

You may wonder how we can use Equation 2.1 when we do not know \(\theta\) and \(\phi\) yet (which is exactly what we are trying to do!). If you notice this, it is a very keen observation!.

If fact, we are going to use the estimated \(\theta\) and \(\phi\) we obtained above using hard k-means.

  1. For each biomarker \(n\), we obtain \(S_n\) based on \(S\). Then we iterate through all participants. If this participant is healthy, we include their biomarker measurement in cluster_phi. If this participant is diseased, we compare between \(P_{\theta}\) and \(P_{\phi}\). If \(S_n = 2\), then \(P_{\theta} = 0.1 + 0.3 = 0.4\) and \(P_{\phi} = 0.6\). Because \(P_{\phi}\) is larger, we include this participant’s biomarker measurement in cluster_phi. When the iteration through participants is done, we can calculate the mean and standard deviation of each cluster.
Tip

If \(P_{\theta} = P_{\phi}\), we randomly assign this participant’s biomarker measurement to a cluster.

Code
def compute_single_measurement_likelihood(theta_phi, biomarker, affected, measurement):
    '''Computes the likelihood of the measurement value of a single biomarker

    We know the normal distribution defined by either theta or phi
    and we know the measurement. This will give us the probability
    of this given measurement value. 

    input:
    - theta_phi: the dictionary containing theta and phi values for each biomarker
    - biomarker: an integer between 0 and 9 
    - affected: boolean 
    - measurement: the observed value for a biomarker in a specific participant

    output: a scalar
    '''
    biomarker_dict = theta_phi[biomarker]
    mu = biomarker_dict['theta_mean'] if affected else biomarker_dict['phi_mean']
    std = biomarker_dict['theta_std'] if affected else biomarker_dict['phi_std']
    var = std**2
    if var <= int(0) or np.isnan(measurement) or np.isnan(mu):
        print(f"Invalid values: measurement: {measurement}, mu: {mu}, var: {var}")
        likelihood = np.exp(-(measurement - mu)**2 /
                            (2 * var)) / np.sqrt(2 * np.pi * var)
    else:
        likelihood = np.exp(-(measurement - mu)**2 /
                            (2 * var)) / np.sqrt(2 * np.pi * var)
    return likelihood

def fill_up_kj_and_affected(pdata, k_j):
    '''Fill up a single participant's data using k_j; basically add two columns: 
    k_j and affected
    Note that this function assumes that pdata already has the S_n column

    Input:
    - pdata: a dataframe of ten biomarker values for a specific participant 
    - k_j: a scalar
    '''
    data = pdata.copy()
    data['k_j'] = k_j
    data['affected'] = data.apply(lambda row: row.k_j >= row.S_n, axis=1)
    return data

def compute_likelihood(pdata, k_j, theta_phi):
    '''
    This function computes the likelihood of seeing this sequence of biomarker values 
    for a specific participant, assuming that this participant is at stage k_j
    '''
    data = fill_up_kj_and_affected(pdata, k_j)
    likelihood = 1
    for i, row in data.iterrows():
        biomarker = row['biomarker']
        measurement = row['measurement']
        affected = row['affected']
        likelihood *= compute_single_measurement_likelihood(
            theta_phi, biomarker, affected, measurement)
    return likelihood

def obtain_participants_hashmap(
        data, 
        prior_theta_phi_estimates,
):
    """
    Input:
        - data: a pd.dataframe. For exrample, 150|200_3.csv
        - prior_theta_phi_estimates, a hashmap of dicts. 
            This is the result from hard k-means 
    
    Output: 
        - hashmap: a dictionary whose key is participant id
            and value value is a dict whose key is stage 
            and value is normalized likelihood
    """
    # initialize hashmap_of_normalized_stage_likelihood_dicts
    participants_hashmap = {}
    non_diseased_participants = data[
        data.diseased == False]['participant'].unique()
    disease_stages = data.S_n.unique()
    for p in data.participant.unique():
        dic = defaultdict(int)
        pdata = data[data.participant == p].reset_index(drop = True)
        if p in non_diseased_participants:
            dic[0] = 1
        else:
            for k_j in disease_stages:
                kj_ll = compute_likelihood(pdata, k_j, prior_theta_phi_estimates)
                dic[k_j] = kj_ll
            # likelihood sum
            sum_ll = sum(dic.values())
            epsilon = 1e-10
            if sum_ll == 0:
                sum_ll = epsilon
            normalized_lls = [l/sum_ll for l in dic.values()]
            normalized_ll_dict = dict(zip(disease_stages, normalized_lls))
            participants_hashmap[p] = normalized_ll_dict
    return participants_hashmap 

def calc_soft_kmeans_for_biomarker(
        data,
        biomarker,
        participants_hashmap

):
    """obtain theta, phi estimates using soft kmeans for a single biomarker
    Inputs:
        - data: a pd.dataframe. For example, 150|200_3.csv
        - biomarker: a str, a certain biomarker name
        - hashmap: a dict, returned result of obtain_hashmap()
    Outputs:
        - theta_mean, theta_std, phi_mean, phi_std, a tuple of floats
    """
    non_diseased_participants = data[
        data.diseased == False]['participant'].unique()
    disease_stages = data.S_n.unique()
     # DataFrame for this biomarker
    biomarker_df = data[
        data['biomarker'] == biomarker].reset_index(
            drop=True).sort_values(
                by = 'participant', ascending = True)
    # Extract measurements
    measurements = np.array(biomarker_df['measurement'])

    this_biomarker_order = biomarker_df.S_n[0]

    affected_cluster = []
    non_affected_cluster = []

    for p in data.participant.unique():
        if p in non_diseased_participants:
            non_affected_cluster.append(measurements[p])
        else:
            normalized_ll_dict = participants_hashmap[p]
            affected_prob = sum(
                normalized_ll_dict[
                    kj] for kj in disease_stages if kj >= this_biomarker_order)
            non_affected_prob = sum(
                normalized_ll_dict[
                    kj] for kj in disease_stages if kj < this_biomarker_order)
            if affected_prob > non_affected_prob:
                    affected_cluster.append(measurements[p])
            elif affected_prob < non_affected_prob:
                non_affected_cluster.append(measurements[p])
            else:
                # Assign to either cluster randomly if probabilities are equal
                if np.random.random() > 0.5:
                    affected_cluster.append(measurements[p])
                else:
                    non_affected_cluster.append(measurements[p])
    # Compute means and standard deviations
    theta_mean = np.mean(affected_cluster) if affected_cluster else np.nan
    theta_std = np.std(affected_cluster) if affected_cluster else np.nan
    phi_mean = np.mean(
        non_affected_cluster) if non_affected_cluster else np.nan
    phi_std = np.std(non_affected_cluster) if non_affected_cluster else np.nan
    return theta_mean, theta_std, phi_mean, phi_std

def cal_soft_kmeans_for_biomarkers(
        data,
        participants_hashmap,
        prior_theta_phi_estimates,
):
    soft_kmeans_estimates = {}
    biomarkers = data.biomarker.unique()
    for biomarker in biomarkers:
        dic = {'biomarker': biomarker}
        prior = prior_theta_phi_estimates[biomarker]
        theta_mean, theta_std, phi_mean, phi_std = calc_soft_kmeans_for_biomarker(
            data, biomarker, participants_hashmap
        )
        if theta_std == 0 or math.isnan(theta_std):
            theta_mean = prior['theta_mean']
            theta_std = prior['theta_std']
        if phi_std == 0 or math.isnan(phi_std):
            phi_mean = prior['phi_mean']
            phi_std = prior['phi_std']
        dic['theta_mean'] = theta_mean
        dic['theta_std'] = theta_std
        dic['phi_mean'] = phi_mean
        dic['phi_std'] = phi_std
        soft_kmeans_estimates[biomarker] = dic
    return soft_kmeans_estimates
participants_hashmap = obtain_participants_hashmap(
    data = df, 
    prior_theta_phi_estimates = hard_kmeans_estimates,
)

soft_kmeans_estimates = cal_soft_kmeans_for_biomarkers(
        data = df,
        participants_hashmap = participants_hashmap,
        prior_theta_phi_estimates = hard_kmeans_estimates,
)
soft_kmeans_estimates_df = pd.DataFrame.from_dict(
    soft_kmeans_estimates, orient='index')
soft_kmeans_estimates_df.reset_index(drop=True, inplace=True)
soft_kmeans_estimates_df
biomarker theta_mean theta_std phi_mean phi_std
0 HIP-FCI -5.378366 7.232544 5.092800 1.514369
1 PCC-FCI 5.710289 2.864833 12.098428 3.688180
2 AB 153.648672 18.041801 253.189546 51.038112
3 P-Tau -42.235505 34.734178 -24.626343 14.860556
4 MMSE 23.122406 2.445751 28.049683 0.718480
5 ADAS -18.374956 5.810630 -5.889272 1.283121
6 HIP-GMI 0.442551 0.261505 0.373983 0.234485
7 AVLT-Sum 23.819301 6.942910 41.371798 14.381565
8 FUS-GMI 0.496420 0.052929 0.595958 0.060867
9 FUS-FCI -6.326653 1.695438 -10.205164 3.733813
make_chart(
    biomarkers[0:5], 
    soft_kmeans_estimates_df, 
    truth_df, 
    title = "Comparing Theta and Phi Distributions Using Soft K-Means"
)
Figure 4.3: Comparing Theta and Phi Distributions Using Soft K-Mean

4.4 Conclusion

We compare the above three methods. Hard k-means has the least number of prerequisites: it only needs to know whether participants are healthy or not and biomarker measurements. However, the drawback is that it might not be very accurate. Conjugate priors are extremely accurate; however, it requires knowledge of almost everything: besides what is required by hard k-means, it also requires \(S\) and \(k_j\). Soft k-kmeans does not require the knowledge of \(k_j\) and is an improvement over hard k-means.

We also noticed that both conjugate priors and soft k-means need to use the result from hard k-means as a fallback.