Beyond Metrics: Evaluating ML Weather Models Physically - A Literature Review

Beyond Metrics: Evaluating ML Weather Models Physically - A Literature Review#

Introduction#

Machine learning is increasingly being used in weather prediction, offering the promise of faster and more detailed forecasts. Models like FourCastNet, FuXi, and NeuralGCM have shown impressive abilities to predict complex weather events, which is crucial for disaster preparedness and resource management. Their speed and efficiency make them attractive alternatives to traditional forecasting methods.

However, these machine learning models often face challenges in capturing the physical realities of the atmosphere. While they perform well on standard metrics like RMSE and ACC, they may not accurately represent important physical phenomena such as small-scale atmospheric processes or the chaotic nature of weather systems. This highlights the need for evaluations that go beyond traditional statistical measures, incorporating physical diagnostics to ensure that the models are both accurate and realistic.

Review of Key Papers#

Paper 1: FOURCASTNET A GLOBAL DATA-DRIVEN HIGH-RESOLUTION WEATHER MODEL USING ADAPTIVE FOURIER NEURAL OPERATORS#

Authors#

Jaideep Pathak et al. (NVIDIA)

Key highlights#

FourCastNet generates week-long forecasts in under 2 seconds. Operates at 0.25° resolution, capturing fine details like small-scale winds and precipitation.
Accurately predicts hurricanes, atmospheric rivers, and extreme precipitation.
Supports robust probabilistic forecasting by generating large ensembles quickly.
Competes with advanced NWP systems for many key variables.

Data#

ERA5 Reanalysis Data: Global atmospheric data at a resolution of 0.25°, spanning decades (1979–2018).

Methodology#

Model Architecture:

Adaptive Fourier Neural Operator (AFNO):
- Combines Fourier Neural Operators for efficient spatial representation and Vision Transformers (ViT) for modeling long-range dependencies.
Separate diagnostic model for precipitation, addressing its sparse and skewed distribution.

Training Process:

Pretraining: One-step forecasts optimized for initial learning.
Fine-Tuning: Multi-step predictions refined for longer time horizons.
Tools: Cosine learning rate schedules and training across 64 NVIDIA A100 GPUs.

Inference:

Autoregressive Mode: Model predicts sequential time steps iteratively.
Large ensembles generated by perturbing initial conditions with Gaussian noise for probabilistic forecasting.

Evaluation Metrics:

Metrics include Anomaly Correlation Coefficient (ACC) and Root Mean Squared Error (RMSE).
Comparison with IFS forecasts and other deep learning models.

Key Results#

Qualitative Analysis:

FourCastNet demonstrated high accuracy in forecasting small-scale, short-term weather phenomena, such as hurricanes, atmospheric rivers, and extreme precipitation.
Example Highlight: The model successfully tracked Hurricane Michael (2018), accurately predicting its formation, rapid intensification, and trajectory.

Precipitation Diagnosis:

Despite the challenges of predicting precipitation due to its intermittent and stochastic nature, FourCastNet achieved remarkable skill in capturing high-resolution features in short-term forecasts.

Comparison with the IFS Model (ECMWF):

FourCastNet proved competitive with IFS in metrics like the Anomaly Correlation Coefficient (ACC) and Root Mean Squared Error (RMSE).
It outperformed IFS in short-term (up to 48 hours) predictions for key variables such as wind speed and temperature.

Specific Case Studies:

Hurricanes: The model successfully tracked hurricanes like Michael, capturing rapid intensification and accurate trajectories over a 72-hour period.
Atmospheric Rivers: Effectively forecasted water vapor columns, including the “Pineapple Express,” demonstrating potential for flood warning systems.

Overland Forecasting Capabilities:

The model delivered precise near-surface wind speed predictions over land, vital for wind energy development and disaster management.

Extreme Weather Predictions:

FourCastNet performed well in predicting extreme values (e.g., heavy rainfall, strong winds) but showed a slight underestimation of the most extreme cases.

Computational Advantages:

Capable of generating ensemble forecasts with 1,000 members in seconds, enabling robust probabilistic models and improving early extreme event warnings.
Operates 45,000 times faster than traditional models like IFS and uses significantly less energy.

Machine Learning Tools#

Adaptive Fourier Neural Operator (AFNO): The AFNO architecture is the core of FourCastNet, designed to handle high-resolution spatial data efficiently. It leverages Fourier transforms to perform global spatial token mixing, significantly reducing computational complexity compared to traditional methods. By operating in the Fourier domain, it achieves O(Nlog⁡N)O(N \log N)O(NlogN) complexity, enabling scalability for high-resolution grids.

Vision Transformer (ViT) Backbone: The architecture incorporates a Vision Transformer to capture long-range dependencies in data. Tokens (spatial patches of the input grid) are processed with multi-head self-attention mechanisms, allowing the model to identify intricate relationships across the globe, such as the interaction of distant atmospheric patterns.

Diagnostic Precipitation Model: Precipitation forecasting is treated separately due to its sparse and non-linear nature. A dedicated AFNO-based module predicts accumulated precipitation by post-processing outputs from the main model. It uses a log-transformed representation to handle the skewed data distribution.

Multi-Step Autoregressive Forecasting: The model is fine-tuned for autoregressive inference, where predictions from one time step feed into the next. This approach ensures temporal consistency over extended forecasts while maintaining high spatial accuracy.

Efficient Design for Scalability: The architecture minimizes memory footprint and computational load by processing high-resolution grids (720×1440 pixels) with efficient token mixing. This makes FourCastNet capable of handling global-scale weather data with unprecedented speed.

Strengths#

FourCastNet generates week-long global forecasts in under 2 seconds, significantly faster than traditional Numerical Weather Prediction (NWP) models like IFS.
Operates at a 0.25° resolution (approximately 30 km globally), capturing small-scale atmospheric features such as cyclones and localized precipitation, surpassing many prior deep learning and NWP models.
Accurately predicts complex phenomena like hurricanes, atmospheric rivers, and extreme precipitation events, including their formation, trajectory, and intensification.
Capable of generating large ensemble forecasts with thousands of members, enabling robust uncertainty quantification and improved reliability for extreme event prediction.

Limitations#

Unlike traditional Numerical Weather Prediction (NWP) models, FourCastNet does not incorporate explicit physics-based equations. This may limit its ability to ensure physical consistency in extreme and long-term forecasts.
The model operates with fewer vertical levels (5) compared to NWP models like ECMWF IFS, which utilize more than 50 levels. This restricts its ability to capture detailed vertical atmospheric dynamics.
Despite high resolution, the diagnostic precipitation model underestimates extremes and struggles with the sparse and skewed distribution of precipitation data.

Relevance to my investigation#

The model is tested on critical use cases, including hurricane tracking, atmospheric river forecasting, and precipitation prediction
Given the inherent chaos of the atmosphere, the paper emphasizes the role of ensemble forecasts in capturing uncertainty and improving prediction reliability
The paper uses both deterministic metrics like RMSE and ACC, and probabilistic evaluations through ensemble forecasting to assess model performance comprehensively.

Paper 2: FuXi a cascade machine learning forecasting system for 15-day global weather forecast#

Authors#

Lei Chen et al.

Key highlights#

It employs a cascade architecture, using pre-trained models fine-tuned for specific forecast windows: short-term (0–5 days), medium-term (5–10 days), and long-term (10–15 days).

Through ensemble forecasting, FuXi creates diverse prediction scenarios by introducing noise-based perturbations and model variability.

Data#

ERA5: Produced by ECMWF, this is a reanalysis dataset with historical weather data offering high spatial and temporal resolution. It is used as the ground truth for training and validating the FuXi model.
HRES-fc0: ECMWF’s high-resolution deterministic forecast data (first time step). It serves as a benchmark to compare the accuracy of FuXi’s deterministic weather predictions.
ENS-fc0: ECMWF’s ensemble mean forecast data, representing probabilistic predictions (first time step). It is used to evaluate FuXi’s ensemble forecasts and their ability to quantify uncertainty.

Methodology#

Model Architecture: FuXi employs a three-stage cascade architecture, each optimized for specific time windows (short, medium, and long-term forecasts). It integrates cube embedding for dimensionality reduction, a U-Transformer for data processing, and a fully connected layer for predictions.

Inputs and Outputs: FuXi uses weather data from two previous time steps as inputs, predicting global weather conditions in 6-hour intervals for up to 15 days with high spatial resolution.

Loss Function: Latitude-weighted L1 loss is employed to minimize prediction errors with adjustments for geographic variability, enhancing precision across latitudes.

Training: The training process includes pre-training for short-term model and targeted fine-tuning for all model (short-term, medium-term and long-term), leveraging advanced GPUs for large-scale data processing.

Key Results#

FuXi surpasses ECMWF’s high-resolution forecast (HRES) in terms of accuracy for medium- and long-term predictions.
Demonstrates performance on par with GraphCast and other advanced ML models, excelling particularly in longer lead times.

FuXi’s ensemble provides probabilistic forecasts, offering comparable CRPS (Continuous Ranked Probability Score) to ECMWF ensembles for up to 9 days.

The cascaded architecture of FuXi reduces cumulative forecast errors and ensures optimal performance for short (0–5 days), medium (5–10 days), and long-term (10–15 days) forecasting windows.

FuXi demonstrates a significant computational advantage over traditional numerical weather prediction (NWP) systems, achieving comparable results with lower resource demands.

Machine Learning Tools#

Architecture:

U-Transformer with Swin Transformer V2 for handling spatial-temporal weather data efficiently.
Includes residual post-normalization and scaled cosine attention for stability and precision.

Embedding:

Space-Time Cube Embedding reduces high-dimensional input using 3D convolutions.

Encoder-Decoder:

Encoder: Downsampling with residual blocks, GN normalization, and SiLU activation.
Decoder: Upsampling with transposed convolutions and skip connections.

Enhancements:

Improve ensemble spread for long-term forecasts.
Extend the model to handle 14–28 day forecasts (Sub-Seasonal Predictions)
Develop fully ML-based forecasts without relying on numerical models.

Strengths#

The cascade model architecture minimizes cumulative errors by optimizing for short, medium, and long-term forecasts.
FuXi provides a computationally efficient alternative to traditional numerical weather prediction (NWP) models, requiring fewer resources for similar or better performance.
Generates forecasts at a spatial resolution of 0.25° and temporal resolution of 6 hours, matching or exceeding the detail of existing systems.
Incorporates ensemble forecasting for uncertainty quantification, producing results comparable to ECMWF ensembles for short and medium lead times.
Designed with flexibility for future enhancements, such as sub-seasonal forecasts (14–28 days) and fully end-to-end ML-based systems.

Limitations#

While FuXi performs well for up to 15 days, its ensemble forecasts show decreasing accuracy beyond 9 days compared to ECMWF ensemble mean (EM).
derived data for initial states and lacks a self-sufficient data assimilation method.
Increasing autoregressive steps to improve long-term forecasts results in higher memory and computational requirements, which could limit scalability.

Relevance to my investigation#

The paper evaluates FuXi’s performance using both deterministic (RMSE, ACC) and probabilistic (CRPS, SSR) metrics, ensuring a comprehensive assessment of forecast accuracy and uncertainty.
FuXi addresses the unpredictability of the atmosphere by employing ensemble methods to quantify and manage uncertainty.

Paper 3: Neural general circulation models for weather and climate#

Authors#

Dmitrii Kochkov et al. (Google, MIT, ECMWF)

Key highlights#

NeuralGCM combines physics-based general circulation models (GCMs) with machine learning
NeuralGCM outperforms traditional models in computational efficiency, running up to 5 orders of magnitude faster, while maintaining comparable or better forecasting precision.
While effective in current climates, NeuralGCM struggles with extrapolating to extreme future scenarios. It shows promise for integration into broader Earth-system models, advancing forecasting and climate science.

Data#

ECMWF-HRES and ECMWF-ENS: These are high-resolution and ensemble forecasting systems from the European Centre for Medium-Range Weather Forecasts. Used as benchmarks for comparing NeuralGCM’s weather forecasting accuracy over different lead times.
GraphCast and Pangu: State-of-the-art machine learning models for weather forecasting. Served as baselines to evaluate NeuralGCM’s performance in short- and medium-range forecasts.
AMIP (Atmospheric Model Intercomparison Project): A dataset of climate simulations using prescribed sea surface temperatures. NeuralGCM was evaluated against AMIP runs to test its ability to simulate long-term climate patterns and trends.

Methodology#

Model Architecture: NeuralGCM combines a physics-based atmospheric solver with neural networks to integrate large-scale dynamics and smaller-scale processes into a unified forecasting model.

Inputs and Outputs: Inputs are atmospheric profiles and external forcings (); outputs forecast tendencies, capturing the evolution of climate metrics and weather states.

Loss Function: Deterministic Models use mean squared error (MSE) to ensure accuracy, with add terms to penalize bias and control high-frequency spatial errors. Stochastic Models use Continuous Ranked Probability Score (CRPS) to balance prediction accuracy with uncertainty representation.

Training: Training involves fine-tuning on reanalysis data like ERA5, gradually increasing time horizons to improve stability and accuracy.

Key Results#

NeuralGCM matches or outperforms traditional physics-based and machine-learning models in short- to medium-range weather forecasts (1–15 days), with lower error metrics like RMSE and CRPS.
NeuralGCM delivers forecasts with 3 to 5 orders of magnitude fewer computational resources, enabling faster simulations at coarser resolutions without compromising accuracy.
With realistic outputs for long-term climate features, NeuralGCM captures essential patterns like tropical cyclone trajectories and seasonal variations, bridging the gap between short-term and extended forecasts.

Machine Learning Tools#

Architecture: The architecture features fully connected networks optimized for capturing localized atmospheric dynamics. Shared weights ensure consistent predictions across spatial grids, while residual layers address gradient vanishing issues.

Embedding: The neural networks incorporate input embeddings that represent atmospheric states, gradients, and external forcings, standardized for uniformity.

Encoder-Decoder: The encoder maps pressure-level data into the sigma-coordinate system, while the decoder converts forecasts back into standard pressure levels.

Enhancements:

Improving the neural network structure or numerical core could further enhance model precision and performance.
Suggests leveraging real-world observational data to improve the relevance and accuracy of weather predictions.
Suggests flexibility to adapt the model with richer physical modeling or enhanced ML techniques based on specific needs.

Strengths#

This hybrid design enables NeuralGCM to handle complex atmospheric phenomena, from large-scale dynamics to subgrid processes.
The approach supports enhancements such as coupling with other Earth-system components or incorporating more observational data, making it highly adaptable to evolving research needs.
NeuralGCM demonstrates exceptional stability in simulating multi-decade climate patterns, accurately reproducing phenomena such as monsoons, tropical cyclones, and seasonal cycles.

Limitations#

The model’s ability to generalize is limited when faced with climates that deviate substantially from the training data, posing challenges for extreme future scenarios.
The model struggles with learning complex processes that have small but significant impacts on climate timescales, such as feedback mechanisms.
NeuralGCM exhibits occasional drifts in its predictions during long-term simulations, emphasizing the importance of addressing stability and consistency.
The model’s dependence on past data constrains its ability to predict emergent or unobserved climatic phenomena.

Relevance to my investigation#

This paper highlights how hybrid models like NeuralGCM integrate physics-based methods and machine learning, setting a new benchmark for evaluating machine learning in weather prediction (MLWP).
Metrics such as RMSE, RMSB, and CRPS allow for detailed evaluation of NeuralGCM’s performance in deterministic forecasts and its ability to quantify probabilistic uncertainties.
Evaluation spans multiple use cases, including medium-range forecasts, ensemble scenarios, and emergent climatic phenomena like tropical cyclones.

Paper 4: WeatherBench 2 A benchmark for the next generation of data-driven global weather models#

Authors#

Stephan Rasp et al. (Google, ECMWF)

Key highlights#

WeatherBench 2 aims to set a new standard for evaluating AI-based weather forecasting systems. Includes tools and data for public use to assess weather forecasting models.
Includes deterministic (RMSE, SEEPS) and probabilistic metrics (CRPS, spread-skill ratio).
Ground truth is based on ERA5 reanalysis datasets.
Incorporates the complexities of weather forecasting, such as chaotic error growth.
Plans for improvements, such as incorporating direct observations and better evaluation methods for extreme events.

Data#

ERA5 Dataset: High-resolution reanalysis dataset (0.25°) used as ground truth and training data, covering 1979 to the present.
IFS HRES: ECMWF’s operational deterministic high-resolution forecast model with 0.1° resolution.
IFS ENS: Ensemble forecast model with 50 members.
Climatology: Averages from 30 years of ERA5 data used to calculate baseline metrics like anomaly correlation coefficient (ACC).
Graph Neural Network trained on ERA5 with autoregressive predictions at 1° resolution.
Pangu-Weather: Transformer-based AI model trained on ERA5 data with 0.25° resolution.
GraphCast: Multi-mesh Graph Neural Network trained with ERA5, using 0.25° resolution.
FuXi: Cascaded Transformer model trained on ERA5 for short, medium, and long-range forecasts at 0.25° resolution.
NeuralGCM: Hybrid AI-physics model combining machine learning with dynamical cores at 0.7° and 1.4° resolutions.
SphericalCNN: Convolutional model for spherical data trained on ERA5 at 1.4° x 0.7° resolution.

Methodology#

The benchmark uses an open-source framework to evaluate the performance of AI and traditional weather models on global, medium-range forecasts. Metrics are based on operational weather center standards to ensure robust comparisons. Used metrics are:

Root Mean Squared Error (RMSE): Measures average forecast error for deterministic models.
Anomaly Correlation Coefficient (ACC): Assesses how well forecasts capture variability from climatology.
Bias: Examines systematic over- or under-predictions.
Stable Equitable Error in Probability Space (SEEPS): Evaluates categorical precipitation forecasts.
Continuous Ranked Probability Score (CRPS): Quantifies the accuracy and reliability of probabilistic forecasts.
Spread-Skill Ratio: Analyzes the balance between forecast uncertainty and error in ensemble models.
Power Spectra: Compares the energy distribution across spatial scales, emphasizing high-frequency variability. Zonal spectral energy is calculated to assess the ability of models to preserve small- and large-scale weather patterns over time. High-frequency energy loss indicates model smoothing at longer lead times, which is critical for evaluating AI-generated forecasts.

Key Results#

AI models such as GraphCast and Pangu-Weather show comparable performance to traditional high-resolution systems (e.g., IFS HRES) for deterministic metrics up to 3–6 days.
AI models show excessive smoothing over time, leading to reduced small-scale variability, as revealed by spectral energy metrics.
NeuralGCM ENS matches the accuracy of operational ensemble forecasts for some variables in probabilistic evaluations.
AI models perform variably on extreme events, capturing cyclone tracks but often underestimating intensity.

Relevance to my investigation#

Introduces physical metrics (e.g., spectrum analysis) alongside traditional skill scores to evaluate model reliability in representing weather systems.
Establishes standardized benchmarks (WeatherBench 2) for fair and reproducible comparisons of ML weather prediction models.
Identifies issues like excessive smoothing and loss of small-scale energy in AI forecasts, stressing the need for better physical alignment.

Paper 5: On Some Limitations of Current Machine Learning Weather Prediction Models#

Authors#

Massimo Bonavita (ECMWF)

Key highlights#

“Forecasts from Machine Learning (ML) models have energy spectra notably different from those of their training reanalysis fields and Numerical Weather Prediction models”
“This results in overly smooth predictions and weather phenomena at spatial scales shorter than 300–400 km are not properly represented”
“Fundamental physical balances and derived quantities are not realistically represented in the forecasts of the ML models”
“The effective resolution of the ML models’ forecasts is closer to 500–700 km than to the nominal 0.25° and is gradually decreasing with forecast lead time”

Data#

ERA5: ECMWF’s ERA5 reanalysis. Used as training data for ML models and as a benchmark for evaluating forecast accuracy and physical consistency.
ECMWF IFS Forecasts: Generated by the ECMWF’s. Served as a reference for comparing ML model forecasts in terms of spectral resolution and dynamic consistency.
Pangu-Weather Forecast Outputs: Produced by the Pangu-Weather ML model trained on ERA5 reanalysis data. Analyzed for spectral diagnostics, wind balance, and vertical motion to assess forecast fidelity and physical realism.

Methodology#

The authors analyzed ML weather models like Pangu-Weather by comparing their performance to ERA5 reanalysis data and ECMWF forecasts. They used tools such as spectral energy analysis, geostrophic balance checks, and vertical velocity diagnostics to identify gaps in physical consistency and forecast accuracy.

Key Results#

ML models like Pangu-Weather show reduced spectral energy at higher spatial frequencies, resulting in overly smooth forecasts that fail to capture fine-scale weather phenomena effectively.
The forecasts from ML models demonstrate inconsistencies in geostrophic balance, with weaker ageostrophic wind components, leading to less realistic interactions between wind and pressure systems.
ML models underpredict vertical motions, producing weaker and more diffuse vertical velocity fields compared to traditional physics-based models, which impacts the accuracy of weather event predictions like storms and cyclones.
While fast and efficient, these models are better suited for specific tasks than for replacing traditional systems.

Relevance to my investigation#

RMSE and similar metrics miss important details about physical model consistency.
The paper highlights the importance of using diagnostics like spectral energy and geostrophic balance to evaluate MLWP models.
Studying physical realism reveals weaknesses in how MLWP models simulate weather dynamics.

Paper 6: Can Artificial Intelligence-Based Weather Prediction Models Simulate the Butterfly Effect?#

Authors#

T. Selz and G. C. Craig

Key highlights#

“Current artificial-intelligence-based models cannot simulate the butterfly effect and incorrectly suggest unlimited atmospheric predictability”
“Their error growth rate and structure remain similar to synoptic-scale error growth regardless of the amplitude of the initial perturbation”
Synoptic-scale error growth from current levels of initial condition uncertainty appears mostly realistic, except for a short initial decay”

Data#

Pangu Model Outputs: Generated forecasts with different perturbation levels.
ICON Model Data: High-resolution physical model outputs, including convection-permitting simulations.

Methodology#

The study evaluates the performance of the AI-based Pangu-Weather model and the physics-based ICON model. Researchers conducted experiments with varying levels of initial condition uncertainties (100% for realistic scenarios and 0.1% to simulate the butterfly effect). They analyzed error growth by introducing perturbations into the models, running simulations for 72 hours, and comparing outputs using metrics such as Difference Kinetic Energy (DKE) and spatial/spectral error growth diagnostics.

Key Results#

AI-based models, like Pangu-Weather, cannot replicate the butterfly effect, suggesting an unrealistic unlimited atmospheric predictability.
Pangu-Weather underestimates error growth at small scales, diverging significantly from the nonlinear dynamics seen in traditional models.
When initial condition uncertainties are large (100%), Pangu-Weather closely mirrors the performance of physical models like ICON.
Using large ensembles can lower sampling uncertainty and enhance the accuracy of extreme event predictions, but this is only effective if the model accurately reflects realistic error growth dynamics.

Relevance to my investigation#

The paper highlights the need for machine learning weather prediction (MLWP) models to maintain physical consistency, especially in chaotic atmospheric features like the butterfly effect.
The research reveals that MLWP models struggle with simulating the intrinsic chaos of atmospheric systems, impacting their physical reliability.
It applies metrics like Difference Kinetic Energy (DKE) to measure error growth, emphasizing their value in testing the physical consistency of MLWP models.
The study underscores the limitations of MLWP models in simulating key physical processes, urging the inclusion of evaluations based on physical principles.

Comparison of Approaches#

Method	Strengths	Limitations	Example Papers
FourCastNet - Adaptive Fourier Neural Operator (AFNO) - Combines Fourier Neural Operators with Vision Transformers - Separate diagnostic model for precipitation	- Generates forecasts in under 2 seconds - High resolution (0.25°), captures fine details - Accurate extreme event predictions - Quickly generates large ensembles	- Lacks explicit physics equations - Fewer vertical levels (5) - Underestimates extreme precipitation	Pathak et al. (2022) ”FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators”
FuXi - Cascade architecture for different forecast windows - U-Transformer with Swin Transformer V2 - Space-Time Cube Embedding	- Reduces cumulative errors - Computationally efficient - High spatial (0.25°) and temporal (6h) resolution - Incorporates ensemble forecasting	- Accuracy drops beyond 9 days - No self-sufficient data assimilation - Higher computational needs for long-term forecasts	Chen et al. (2023) ”FuXi: A cascade machine learning forecasting system for 15-day global weather forecast”
NeuralGCM - Hybrid of physics-based GCMs and neural networks - Fully connected networks with shared weights - Encoder-decoder with sigma-coordinate system	- Handles complex atmospheric phenomena - Stable in multi-decade simulations - Extremely computationally efficient	- Limited generalization to extreme climates - Occasional prediction drifts - Struggles with emergent phenomena	Kochkov et al. (2024) ”Neural general circulation models for weather and climate”

Challenges and Future Directions:#

Improve Physical Realism: Enhance ML models to better represent small-scale processes and atmospheric chaos.
Preserve Fine Details: Develop methods to prevent overly smooth forecasts and maintain small-scale variability.
Simulate Atmospheric Chaos: Advance techniques to capture chaotic behaviors like the butterfly effect.
Generalize to Extremes: Enable models to predict unprecedented events beyond training data.
**Increase Vertical Resolution: Add more vertical levels to better capture atmospheric dynamics.
Enhance Precipitation Forecasting: Improve models to handle sparse, skewed precipitation data and predict extremes.
Optimize Computational Efficiency: Design efficient architectures for scalable, long-term forecasting.
Improve Ensemble Methods: Develop better ensemble techniques to capture uncertainty and chaotic behaviors.
Use Physical Diagnostics: Incorporate physical metrics for comprehensive model evaluation.
Encourage Collaboration: Foster interdisciplinary work to integrate domain expertise with ML advancements.

Conclusion#

While machine learning models like FourCastNet, FuXi, and NeuralGCM have shown promise in weather prediction, traditional metrics like RMSE (Root Mean Squared Error) and ACC (Anomaly Correlation Coefficient) are not sufficient to fully evaluate their performance. These metrics measure statistical accuracy but do not assess whether the models accurately represent the physical realities of the atmosphere, such as chaotic behavior and small-scale weather phenomena.

Studies have found that these models often produce overly smooth forecasts, losing important small-scale details and failing to capture the chaotic nature of weather systems—the so-called butterfly effect. Physical evaluations using tools like power spectra analysis and assessments of geostrophic balance have revealed that machine learning models may not preserve the energy distribution across different spatial scales or maintain essential physical balances, issues that traditional metrics fail to highlight.

Therefore, it’s essential to go beyond standard statistical measures and incorporate physical diagnostics when evaluating machine learning weather models. By doing so, we can better identify their limitations and guide improvements, ensuring that these models are not only statistically accurate but also physically realistic and reliable for weather forecasting.

Questions About Methodologies and ML Approaches#

FourCastNet’s Architecture and Training:
- How does FourCastNet utilize Adaptive Fourier Neural Operators (AFNO) in its model architecture, and what advantages does this offer for high-resolution weather forecasting?
- Why does FourCastNet use a separate diagnostic model for precipitation, and how does this approach address the challenges of predicting sparse and skewed precipitation data?
- What role do the cosine learning rate schedules and multi-GPU training play in optimizing FourCastNet’s performance?
FuXi’s Cascade Model Approach:
- How does FuXi incorporate ensemble forecasting, and what methods are used to introduce diversity in its prediction scenarios?
- In what ways does FuXi’s use of the U-Transformer and Swin Transformer V2 contribute to its handling of spatial-temporal weather data?
- What are the specific techniques FuXi uses in fine-tuning its models for different forecasting windows, and how do they impact model stability?
NeuralGCM’s Hybrid Design:
- What challenges does NeuralGCM face when extrapolating to extreme future climate scenarios, and how might its methodology be adapted to address these limitations?
- How does NeuralGCM’s use of fully connected networks and shared weights enhance its ability to capture localized atmospheric dynamics?
- How does NeuralGCM’s training process ensure stability and accuracy over multi-decade climate simulations?*
WeatherBench 2’s Benchmarking Framework:
- What are the key features of WeatherBench 2, and how does it aim to standardize the evaluation of AI-based weather forecasting systems?
- How does WeatherBench 2 facilitate fair and reproducible comparisons among different ML weather prediction models?
- In what ways can WeatherBench 2 be improved to better evaluate extreme weather events and their representation in ML models?
Limitations and Challenges of Current ML Approaches:
- What are some limitations of ML weather models in representing small-scale atmospheric processes, and how do these limitations affect forecast accuracy?
- How does the inability of some ML models to simulate chaotic atmospheric behavior impact their reliability for long-term forecasting?
- What strategies can be employed to improve the physical realism of ML weather models and overcome the identified limitations?
- How does the over-smoothing observed in ML model forecasts impact the representation of fine-scale weather phenomena?
- How critical is it for ML models to accurately simulate the butterfly effect, and what approaches can help achieve this?
Comparative Analysis of ML Models and Traditional NWP:
- In what ways do ML models like FourCastNet and FuXi outperform traditional Numerical Weather Prediction (NWP) models, and where do they fall short?
- How do computational efficiency and scalability factor into the adoption of ML approaches in operational weather forecasting?
- What are the implications of ML models operating without explicit physics-based equations on their long-term forecast reliability?
Ensemble Forecasting and Uncertainty Quantification:
- What are the methods used by ML models to introduce perturbations for ensemble forecasting, and how effective are they in practice?
- How do probabilistic metrics like CRPS and spread-skill ratio contribute to the evaluation of ensemble forecasts in ML models?
- How do ML models handle initial condition uncertainties in ensemble forecasting compared to traditional models?
- What are the limitations of current ML ensemble methods in representing the full spectrum of possible weather outcomes?
- How can ensemble forecasting in ML models be improved to better capture chaotic atmospheric behaviors?

Bibliografia#

Bonavita, M. (2024). On Some Limitations of Current Machine Learning Weather Prediction Models. Geophysical Research Letters, 51(12), e2023GL107377. https://doi.org/10.1029/2023GL107377

Chen, L., Zhong, X., Zhang, F., Cheng, Y., Xu, Y., Qi, Y., & Li, H. (2023). FuXi: A cascade machine learning forecasting system for 15-day global weather forecast. Npj Climate and Atmospheric Science, 6(1), 1–11. https://doi.org/10.1038/s41612-023-00512-1

Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., & Hoyer, S. (2024). Neural general circulation models for weather and climate. Nature, 632(8027), 1060–1066. https://doi.org/10.1038/s41586-024-07744-y

Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., & Anandkumar, A. (2022, February 22). FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. arXiv.Org. https://arxiv.org/abs/2202.11214v1

Rasp, S., Hoyer, S., Merose, A., Langmore, I., Battaglia, P., Russel, T., Sanchez-Gonzalez, A., Yang, V., Carver, R., Agrawal, S., Chantry, M., Bouallegue, Z. B., Dueben, P., Bromberg, C., Sisk, J., Barrington, L., Bell, A., & Sha, F. (2024). WeatherBench 2: A benchmark for the next generation of data-driven global weather models (arXiv:2308.15560). arXiv. https://doi.org/10.48550/arXiv.2308.15560

Selz, T., & Craig, G. C. (2023). Can Artificial Intelligence-Based Weather Prediction Models Simulate the Butterfly Effect? Geophysical Research Letters, 50(20), e2023GL105747. https://doi.org/10.1029/2023GL105747

Beyond Metrics: Evaluating ML Weather Models Physically - A Literature Review

Contents

Beyond Metrics: Evaluating ML Weather Models Physically - A Literature Review#

Introduction#

Review of Key Papers#

Paper 1: FOURCASTNET A GLOBAL DATA-DRIVEN HIGH-RESOLUTION WEATHER MODEL USING ADAPTIVE FOURIER NEURAL OPERATORS#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Machine Learning Tools#

Strengths#

Limitations#

Relevance to my investigation#

Paper 2: FuXi a cascade machine learning forecasting system for 15-day global weather forecast#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Machine Learning Tools#

Strengths#

Limitations#

Relevance to my investigation#

Paper 3: Neural general circulation models for weather and climate#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Machine Learning Tools#

Strengths#

Limitations#

Relevance to my investigation#

Paper 4: WeatherBench 2 A benchmark for the next generation of data-driven global weather models#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Relevance to my investigation#

Paper 5: On Some Limitations of Current Machine Learning Weather Prediction Models#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Relevance to my investigation#

Paper 6: Can Artificial Intelligence-Based Weather Prediction Models Simulate the Butterfly Effect?#

Authors#

Key highlights#

Data#

Methodology#

Key Results#

Relevance to my investigation#

Comparison of Approaches#

Challenges and Future Directions:#

Conclusion#

Questions About Methodologies and ML Approaches#

Bibliografia#