Independency Test_v1

Mann-Whitney U Test for Homogeneity

The Mann-Whitney U test (also known as Wilcoxon rank-sum test) is a non-parametric test used to determine if two independent samples come from the same distribution.

Mann-Whitney U Statistic Formula:

For two samples with sizes $n_1$ and $n_2$:

$$U_1 = n_1 \cdot n_2 + \frac{n_1(n_1 + 1)}{2} - R_1$$

$$U_2 = n_1 \cdot n_2 + \frac{n_2(n_2 + 1)}{2} - R_2$$

Where:

$n_1$ = number of observations in first sample (first half)
$n_2$ = number of observations in second sample (second half)
$R_1$ = sum of ranks for first sample
$R_2$ = sum of ranks for second sample

The test statistic is: $U = \min(U_1, U_2)$

Ranking Process:

Combine both samples and rank all observations from smallest to largest
Assign ranks: smallest value gets rank 1, largest gets rank $(n_1 + n_2)$
For tied values, assign the average of the ranks they would occupy
Sum the ranks for each sample separately

Test Statistic (for large samples):

$$Z = \frac{U - \mu_U}{\sigma_U}$$

Where: $$\mu_U = \frac{n_1 \cdot n_2}{2}$$

$$\sigma_U = \sqrt{\frac{n_1 \cdot n_2 \cdot (n_1 + n_2 + 1)}{12}}$$

P-value Calculation:

The p-value is calculated from the Z-statistic using the standard normal distribution: $$p\text{-value} = 2 \times P(Z \geq |z|)$$

(The factor of 2 is for the two-tailed test)

Hypothesis Testing:

$H_0$: The two samples come from the same distribution (homogeneous)
$H_1$: The two samples come from different distributions (non-homogeneous)

Decision Rule (α = 0.05):

If $p\text{-value} > 0.05$:
- Accept $H_0$ → Data is homogeneous
- Station PASSES (no significant difference between periods)
If $p\text{-value} \leq 0.05$:
- Reject $H_0$ → Data is non-homogeneous
- Station FAILS (significant difference between periods)

Application in This Test:

The data is split into two equal halves:

First half: Early period observations
Second half: Later period observations

If the p-value > 0.05, it suggests the two periods have similar statistical properties, indicating the data is consistent over time (homogeneous).

Data Homogeneity Testing Using Mann-Whitney U Test

Purpose

This script performs a homogeneity test on precipitation data from multiple stations to determine if the data is consistent over time. The test checks whether the statistical properties of the data remain stable throughout the observation period.

Methodology

Step 1: Data Preparation

Load precipitation data from Excel file containing multiple station records
Each column represents a different monitoring station
Rows represent time series observations (precipitation measurements)

Step 2: Data Splitting

Divide the time series into two equal periods:
- First half: Early observation period
- Second half: Later observation period
This temporal split allows us to compare statistical properties across time

Step 3: Statistical Testing

Apply the Mann-Whitney U test to each station independently
Compare the distributions of the two time periods
Use a two-tailed test (α = 0.05 significance level)

Step 4: Results Interpretation

Calculate U-statistic and p-value for each station
Classify stations as:
- Homogeneous (p > 0.05): Data is consistent over time ✓
- Non-homogeneous (p ≤ 0.05): Data shows significant temporal variation ✗

Step 5: Export Results

Save test statistics, p-values, and homogeneity status to Excel
Summary statistics showing how many stations pass/fail the test

Why Mann-Whitney U Test?

Non-parametric: No assumption about data distribution
Robust: Less sensitive to outliers compared to parametric tests
Appropriate for temporal analysis: Effectively compares two independent periods
Appropriate for temporal analysis: Effectively compares two independent periods
Robust: Less sensitive to outliers compared to parametric tests (suitable for precipitation data)## Data Homogeneity Testing Using Mann-Whitney U Test

Purpose

Methodology

Step 1: Data Preparation

Load precipitation data from Excel file containing multiple station records
Each column represents a different monitoring station
Rows represent time series observations (precipitation measurements)

Step 2: Data Splitting

Divide the time series into two equal periods:
- First half: Early observation period
- Second half: Later observation period
This temporal split allows us to compare statistical properties across time

Step 3: Statistical Testing

Apply the Mann-Whitney U test to each station independently with stats.mannwhitneyu
Compare the distributions of the two time periods
Use a two-tailed test (α = 0.05 significance level)

Step 4: Results Interpretation

Calculate U-statistic and p-value for each station
Classify stations as:
- Homogeneous (p > 0.05): Data is consistent over time ✓
- Non-homogeneous (p ≤ 0.05): Data shows significant temporal variation ✗

Step 5: Export Results

Save test statistics, p-values, and homogeneity status to Excel
Summary statistics showing how many stations pass/fail the test

Python

import pandas as pd
import numpy as np 
import scipy.stats as stats
from scipy import stats

Python

file_source = r"\Eligibility Test\Data\grids_precipitation_data.xlsx"
df = pd.read_excel(file_source)

Python

# We'll split the data into two periods and test for homogeneity

# Split data into two halves for homogeneity testing
mid_point = len(df) // 2
first_half = df.iloc[:mid_point]
second_half = df.iloc[mid_point:]

# Store results
mw_results = {}

# Perform Mann-Whitney U test for each station
for col in df.columns[1:]:  # Skip 'Date' column
    statistic, p_value = stats.mannwhitneyu(first_half[col], second_half[col], alternative='two-sided')
    mw_results[col] = {'statistic': statistic, 'p_value': p_value}

# Create results dataframe
mw_df = pd.DataFrame(mw_results).T
mw_df['is_homogeneous'] = mw_df['p_value'] > 0.05  # Accept homogeneity if p > 0.05

print("Mann-Whitney U Test Results for Homogeneity (α = 0.05)")
print("=" * 70)
print(mw_df)
print("\n" + "=" * 70)
print(f"Homogeneous stations: {mw_df['is_homogeneous'].sum()} out of {len(mw_df)}")
print(f"Non-homogeneous stations: {(~mw_df['is_homogeneous']).sum()} out of {len(mw_df)}")

Python

# Save Mann-Whitney U test results to Excel
output_file = r"C:\OneDrive\#FL\Research\PhD\Rainfall Data\Processed\Eligibility Test\Data\mann_whitney_results.xlsx"
mw_df.to_excel(output_file, index=True)
print(f"Results saved to: {output_file}")