Back

Independency Test_v1

Mann-Whitney U Test for Homogeneity

The Mann-Whitney U test (also known as Wilcoxon rank-sum test) is a non-parametric test used to determine if two independent samples come from the same distribution.

Mann-Whitney U Statistic Formula:

For two samples with sizes $n_1$ and $n_2$:

$$U_1 = n_1 \cdot n_2 + \frac{n_1(n_1 + 1)}{2} - R_1$$

$$U_2 = n_1 \cdot n_2 + \frac{n_2(n_2 + 1)}{2} - R_2$$

Where:

  • $n_1$ = number of observations in first sample (first half)
  • $n_2$ = number of observations in second sample (second half)
  • $R_1$ = sum of ranks for first sample
  • $R_2$ = sum of ranks for second sample

The test statistic is: $U = \min(U_1, U_2)$

Ranking Process:

  1. Combine both samples and rank all observations from smallest to largest
  2. Assign ranks: smallest value gets rank 1, largest gets rank $(n_1 + n_2)$
  3. For tied values, assign the average of the ranks they would occupy
  4. Sum the ranks for each sample separately

Test Statistic (for large samples):

$$Z = \frac{U - \mu_U}{\sigma_U}$$

Where: $$\mu_U = \frac{n_1 \cdot n_2}{2}$$

$$\sigma_U = \sqrt{\frac{n_1 \cdot n_2 \cdot (n_1 + n_2 + 1)}{12}}$$

P-value Calculation:

The p-value is calculated from the Z-statistic using the standard normal distribution: $$p\text{-value} = 2 \times P(Z \geq |z|)$$

(The factor of 2 is for the two-tailed test)

Hypothesis Testing:

  • $H_0$: The two samples come from the same distribution (homogeneous)
  • $H_1$: The two samples come from different distributions (non-homogeneous)

Decision Rule (α = 0.05):

  • If $p\text{-value} > 0.05$:

    • Accept $H_0$ → Data is homogeneous
    • Station PASSES (no significant difference between periods)
  • If $p\text{-value} \leq 0.05$:

    • Reject $H_0$ → Data is non-homogeneous
    • Station FAILS (significant difference between periods)

Application in This Test:

The data is split into two equal halves:

  • First half: Early period observations
  • Second half: Later period observations

If the p-value > 0.05, it suggests the two periods have similar statistical properties, indicating the data is consistent over time (homogeneous).

Data Homogeneity Testing Using Mann-Whitney U Test

Purpose

This script performs a homogeneity test on precipitation data from multiple stations to determine if the data is consistent over time. The test checks whether the statistical properties of the data remain stable throughout the observation period.

Methodology

Step 1: Data Preparation

  • Load precipitation data from Excel file containing multiple station records
  • Each column represents a different monitoring station
  • Rows represent time series observations (precipitation measurements)

Step 2: Data Splitting

  • Divide the time series into two equal periods:
    • First half: Early observation period
    • Second half: Later observation period
  • This temporal split allows us to compare statistical properties across time

Step 3: Statistical Testing

  • Apply the Mann-Whitney U test to each station independently
  • Compare the distributions of the two time periods
  • Use a two-tailed test (α = 0.05 significance level)

Step 4: Results Interpretation

  • Calculate U-statistic and p-value for each station
  • Classify stations as:
    • Homogeneous (p > 0.05): Data is consistent over time ✓
    • Non-homogeneous (p ≤ 0.05): Data shows significant temporal variation ✗

Step 5: Export Results

  • Save test statistics, p-values, and homogeneity status to Excel
  • Summary statistics showing how many stations pass/fail the test

Why Mann-Whitney U Test?

  • Non-parametric: No assumption about data distribution
  • Robust: Less sensitive to outliers compared to parametric tests
  • Appropriate for temporal analysis: Effectively compares two independent periods
  • Appropriate for temporal analysis: Effectively compares two independent periods
  • Robust: Less sensitive to outliers compared to parametric tests (suitable for precipitation data)## Data Homogeneity Testing Using Mann-Whitney U Test

Purpose

This script performs a homogeneity test on precipitation data from multiple stations to determine if the data is consistent over time. The test checks whether the statistical properties of the data remain stable throughout the observation period.

Methodology

Step 1: Data Preparation

  • Load precipitation data from Excel file containing multiple station records
  • Each column represents a different monitoring station
  • Rows represent time series observations (precipitation measurements)

Step 2: Data Splitting

  • Divide the time series into two equal periods:
    • First half: Early observation period
    • Second half: Later observation period
  • This temporal split allows us to compare statistical properties across time

Step 3: Statistical Testing

  • Apply the Mann-Whitney U test to each station independently with stats.mannwhitneyu
  • Compare the distributions of the two time periods
  • Use a two-tailed test (α = 0.05 significance level)

Step 4: Results Interpretation

  • Calculate U-statistic and p-value for each station
  • Classify stations as:
    • Homogeneous (p > 0.05): Data is consistent over time ✓
    • Non-homogeneous (p ≤ 0.05): Data shows significant temporal variation ✗

Step 5: Export Results

  • Save test statistics, p-values, and homogeneity status to Excel
  • Summary statistics showing how many stations pass/fail the test
Python
import pandas as pd
import numpy as np 
import scipy.stats as stats
from scipy import stats
Python
file_source = r"\Eligibility Test\Data\grids_precipitation_data.xlsx"
df = pd.read_excel(file_source)
Python
# We'll split the data into two periods and test for homogeneity

# Split data into two halves for homogeneity testing
mid_point = len(df) // 2
first_half = df.iloc[:mid_point]
second_half = df.iloc[mid_point:]

# Store results
mw_results = {}

# Perform Mann-Whitney U test for each station
for col in df.columns[1:]:  # Skip 'Date' column
    statistic, p_value = stats.mannwhitneyu(first_half[col], second_half[col], alternative='two-sided')
    mw_results[col] = {'statistic': statistic, 'p_value': p_value}

# Create results dataframe
mw_df = pd.DataFrame(mw_results).T
mw_df['is_homogeneous'] = mw_df['p_value'] > 0.05  # Accept homogeneity if p > 0.05

print("Mann-Whitney U Test Results for Homogeneity (α = 0.05)")
print("=" * 70)
print(mw_df)
print("\n" + "=" * 70)
print(f"Homogeneous stations: {mw_df['is_homogeneous'].sum()} out of {len(mw_df)}")
print(f"Non-homogeneous stations: {(~mw_df['is_homogeneous']).sum()} out of {len(mw_df)}")
Python
# Save Mann-Whitney U test results to Excel
output_file = r"C:\OneDrive\#FL\Research\PhD\Rainfall Data\Processed\Eligibility Test\Data\mann_whitney_results.xlsx"
mw_df.to_excel(output_file, index=True)
print(f"Results saved to: {output_file}")