Seeing the Sun through the Clouds: Accelerating the SunPy Data Analysis Ecosystem with Dask

SciPy / Austin, TX / 12 July 2023

Will Barnes

AU/NASA GSFC

Nabil Freij

BAERI/LMSAL

Jack Ireland

NASA GSFC

Stuart Mumford

Aperio Software

What is solar physics?

Solar Physics in Python with `sunpy`!

“The community-developed, free and open-source solar data analysis environment for Python.”

The SunPy Ecosystem

Searching for Data with `sunpy`

from sunpy.net import Fido, attrs as a
import astropy.units as u

1query = Fido.search(
  a.Time('2018-05-29 18:00', '2018-05-29 18:00:10'),
  a.Wavelength(171*u.angstrom),
  a.Instrument.aia
)
print(query)

1: Search remote repository for AIA 171 Å images between 18:00 and 18:10 on May 29 2018.

Results from 1 Provider:

1 Results from the VSOClient:
Source: http://vso.stanford.edu/cgi-bin/search
Total estimated size: 67.789 Mbyte

       Start Time               End Time        Source ... Extent Type   Size  
                                                       ...              Mibyte 
----------------------- ----------------------- ------ ... ----------- --------
2018-05-29 18:00:09.000 2018-05-29 18:00:10.000    SDO ...    FULLDISK 64.64844

2files = Fido.fetch(query, path='data/{instrument}')
print(files)

2: Download all FITS files corresponding to this search result to data/AIA.

['data/AIA/aia_lev1_171a_2018_05_29t18_00_09_35z_image_lev1.fits']

Loading Data with `sunpy`

import sunpy.map

m = sunpy.map.Map(files)
m.peek()

Inspecting Data with `sunpy`

print(m.detector)
print(m.wavelength)
print(m.date)
print(m.observer_coordinate)

AIA
171.0 Angstrom
2018-05-29T18:00:09.350
<SkyCoord (HeliographicStonyhurst: obstime=2018-05-29T18:00:09.350, rsun=696000.0 km): (lon, lat, radius) in (deg, deg, m)
    (0.00562552, -0.97774891, 1.51600359e+11)>

from astropy.coordinates import SkyCoord

corner = SkyCoord(Tx=-375*u.arcsec, Ty=0*u.arcsec,
                  frame=m.coordinate_frame)
print(m.world_to_pixel(corner))
m_cutout = m.submap(corner,
                    width=500*u.arcsec,
                    height=500*u.arcsec)
m_cutout.peek()

PixelPair(x=<Quantity 1425.73662106 pix>, y=<Quantity 2049.01122848 pix>)

Transforming Data with `sunpy`

from sunpy.coordinates import propagate_with_solar_surface
m_seq = sunpy.map.Map('data/sequence/*.fits', sequence=True)
fig = plt.figure(figsize=(16, 4))
for i, m in enumerate(m_seq):
  ax = fig.add_subplot(1, len(m_seq), i+1, projection=m)
  m.plot(axes=ax)
  with propagate_with_solar_surface():
    blc = m_cutout.bottom_left_coord.transform_to(m.coordinate_frame)
    trc = m_cutout.top_right_coord.transform_to(m.coordinate_frame)
  m.draw_quadrangle(blc, top_right=trc)

Transforming Data with `sunpy`

with propagate_with_solar_surface():
  m_seq_aligned = sunpy.map.Map([m.reproject_to(m_cutout.wcs) for m in m_seq], sequence=True)
fig = plt.figure(figsize=(16, 4))
for i, m in enumerate(m_seq_aligned):
  ax = fig.add_subplot(1,len(m_seq_aligned), i+1, projection=m)
  m.plot(axes=ax, cmap='sdoaia171', title=m_seq[i].date)

Align all images to the World Coordinate System (WCS) of the cutout
Removes motion due to observer shift, rotatation of the Sun

Challenging at Scale

Analyzing time-dynamics of plasma requires many files
- Delivering and downloading many individual files to users is difficult
- \(12\,\mathrm{h}/12\,\mathrm{s}\times6=21,600\) files
Instantiating many rich Map objects is difficult
- \(\approx130\) MB per file, \(\approx3\) TB per dataset
- Even analyzing a subset of the data requires loading the whole image
Lots of processing required to align and stack images into “data cubes”
- Large computing burden placed on user
- Physical constraints make “analysis-ready” data products challenging
Trade-off between exploratory data analysis and “batch processing”

HelioCloud–Science Platform for Heliophysics

NASA-funded common data analysis environment for heliophysics on AWS
Aligns with NASA Transform to Open Science (TOPS) initiative
Jupyter Lab interface + custom VM instances
~1 Pb of heliophysics data in S3 storage–data and compute are colocated
Open to NASA scientists + collaborators–no NASA credentials needed
Dask Gateway for easy cluster provisioning

Live Demo 🤞

Some painpoints remain…

Directly instantiating Map objects backed by Dask arrays is awkward
- Creating Dask arrays from FITS files is awkward
- Map does not have “first-class” support for S3 URLs
Some Map methods still require “eager” execution
- Requires writing custom @dask.delayed functions that reimplement existing methods
- Friction when moving from local/in-memory environment to parallel/distributed environment
Incompatibility between xarray and WCS approaches to coordinates systems
Friction between astropy.units and Dask
- Cannot have a Quantity backed by a Dask array
- Dask does not correctly serialize Quantity objects

Summary

Analyzing solar physics data requires large volumes (\(\sim\) TB), many files (\(>10^4\))
SunPy Ecosystem provide Python tools for searching for, loading, transforming, and analyzing solar data
HelioCloud platform eliminates need for data download, brings data to compute, provides scalable compute via AWS and Dask
HelioCloud + sunpy + dask = Science result from L1 data in minutes
Challenges remain in reducing friction between Astronomy-in-Python ecosystem and Dask, but progress being made!
Resources
- Website: sunpy.org
- GitHub: github.com/sunpy/sunpy
Built with

Seeing the Sun through the Clouds: Accelerating the SunPy Data Analysis Ecosystem with Dask

What is solar physics?

Solar Physics in Python with sunpy!

The SunPy Ecosystem

Searching for Data with sunpy

Loading Data with sunpy

Inspecting Data with sunpy

Transforming Data with sunpy

Transforming Data with sunpy

Challenging at Scale

HelioCloud–Science Platform for Heliophysics

Live Demo 🤞

Some painpoints remain…

Summary

Solar Physics in Python with `sunpy`!

Searching for Data with `sunpy`

Loading Data with `sunpy`

Inspecting Data with `sunpy`

Transforming Data with `sunpy`

Transforming Data with `sunpy`