Seeing the Sun through the Clouds: Accelerating the SunPy Data Analysis Ecosystem with Dask

SciPy / Austin, TX / 12 July 2023

Will Barnes

AU/NASA GSFC

Nabil Freij

BAERI/LMSAL

Jack Ireland

NASA GSFC

Stuart Mumford

Aperio Software

What is solar physics?

Solar Physics in Python with sunpy!

“The community-developed, free and open-source solar data analysis environment for Python.”

The SunPy Ecosystem

Searching for Data with sunpy

from sunpy.net import Fido, attrs as a
import astropy.units as u

1query = Fido.search(
  a.Time('2018-05-29 18:00', '2018-05-29 18:00:10'),
  a.Wavelength(171*u.angstrom),
  a.Instrument.aia
)
print(query)
1
Search remote repository for AIA 171 Å images between 18:00 and 18:10 on May 29 2018.
Results from 1 Provider:

1 Results from the VSOClient:
Source: http://vso.stanford.edu/cgi-bin/search
Total estimated size: 67.789 Mbyte

       Start Time               End Time        Source ... Extent Type   Size  
                                                       ...              Mibyte 
----------------------- ----------------------- ------ ... ----------- --------
2018-05-29 18:00:09.000 2018-05-29 18:00:10.000    SDO ...    FULLDISK 64.64844

2files = Fido.fetch(query, path='data/{instrument}')
print(files)
2
Download all FITS files corresponding to this search result to data/AIA.
['data/AIA/aia_lev1_171a_2018_05_29t18_00_09_35z_image_lev1.fits']

Loading Data with sunpy

import sunpy.map

m = sunpy.map.Map(files)
m.peek()

Inspecting Data with sunpy

print(m.detector)
print(m.wavelength)
print(m.date)
print(m.observer_coordinate)
AIA
171.0 Angstrom
2018-05-29T18:00:09.350
<SkyCoord (HeliographicStonyhurst: obstime=2018-05-29T18:00:09.350, rsun=696000.0 km): (lon, lat, radius) in (deg, deg, m)
    (0.00562552, -0.97774891, 1.51600359e+11)>
from astropy.coordinates import SkyCoord

corner = SkyCoord(Tx=-375*u.arcsec, Ty=0*u.arcsec,
                  frame=m.coordinate_frame)
print(m.world_to_pixel(corner))
m_cutout = m.submap(corner,
                    width=500*u.arcsec,
                    height=500*u.arcsec)
m_cutout.peek()
PixelPair(x=<Quantity 1425.73662106 pix>, y=<Quantity 2049.01122848 pix>)

Transforming Data with sunpy

from sunpy.coordinates import propagate_with_solar_surface
m_seq = sunpy.map.Map('data/sequence/*.fits', sequence=True)
fig = plt.figure(figsize=(16, 4))
for i, m in enumerate(m_seq):
  ax = fig.add_subplot(1, len(m_seq), i+1, projection=m)
  m.plot(axes=ax)
  with propagate_with_solar_surface():
    blc = m_cutout.bottom_left_coord.transform_to(m.coordinate_frame)
    trc = m_cutout.top_right_coord.transform_to(m.coordinate_frame)
  m.draw_quadrangle(blc, top_right=trc)

Transforming Data with sunpy

with propagate_with_solar_surface():
  m_seq_aligned = sunpy.map.Map([m.reproject_to(m_cutout.wcs) for m in m_seq], sequence=True)
fig = plt.figure(figsize=(16, 4))
for i, m in enumerate(m_seq_aligned):
  ax = fig.add_subplot(1,len(m_seq_aligned), i+1, projection=m)
  m.plot(axes=ax, cmap='sdoaia171', title=m_seq[i].date)

  • Align all images to the World Coordinate System (WCS) of the cutout
  • Removes motion due to observer shift, rotatation of the Sun

Challenging at Scale

  • Analyzing time-dynamics of plasma requires many files
    • Delivering and downloading many individual files to users is difficult
    • \(12\,\mathrm{h}/12\,\mathrm{s}\times6=21,600\) files
  • Instantiating many rich Map objects is difficult
    • \(\approx130\) MB per file, \(\approx3\) TB per dataset
    • Even analyzing a subset of the data requires loading the whole image
  • Lots of processing required to align and stack images into “data cubes”
    • Large computing burden placed on user
    • Physical constraints make “analysis-ready” data products challenging
  • Trade-off between exploratory data analysis and “batch processing”

HelioCloud–Science Platform for Heliophysics

  • NASA-funded common data analysis environment for heliophysics on AWS
  • Aligns with NASA Transform to Open Science (TOPS) initiative
  • Jupyter Lab interface + custom VM instances
  • ~1 Pb of heliophysics data in S3 storage–data and compute are colocated
  • Open to NASA scientists + collaborators–no NASA credentials needed
  • Dask Gateway for easy cluster provisioning

Live Demo 🤞

Some painpoints remain…

  • Directly instantiating Map objects backed by Dask arrays is awkward
    • Creating Dask arrays from FITS files is awkward
    • Map does not have “first-class” support for S3 URLs
  • Some Map methods still require “eager” execution
    • Requires writing custom @dask.delayed functions that reimplement existing methods
    • Friction when moving from local/in-memory environment to parallel/distributed environment
  • Incompatibility between xarray and WCS approaches to coordinates systems
  • Friction between astropy.units and Dask
    • Cannot have a Quantity backed by a Dask array
    • Dask does not correctly serialize Quantity objects

Summary

  • Analyzing solar physics data requires large volumes (\(\sim\) TB), many files (\(>10^4\))
  • SunPy Ecosystem provide Python tools for searching for, loading, transforming, and analyzing solar data
  • HelioCloud platform eliminates need for data download, brings data to compute, provides scalable compute via AWS and Dask
  • HelioCloud + sunpy + dask = Science result from L1 data in minutes
  • Challenges remain in reducing friction between Astronomy-in-Python ecosystem and Dask, but progress being made!
  • Resources
  • Built with