How Interactive Oceans Displays Large Datasets

This notebook will walk you through our process in displaying some of the large datasets from OOI

import cmocean
from dask.utils import memory_repr
import matplotlib.pyplot as plt
import hvplot.xarray

from ooi_harvester.models import OOIDataset

Get Data

We will be requesting Axial Base Shallow Profiler CTD Data

desired_parameters = ['time', 'seawater_pressure', 'seawater_temperature']
ctd = OOIDataset("RS03AXPS-SF03A-2A-CTDPFA302-streamed-ctdpf_sbe43_sample")[desired_parameters]
ctd
<RS03AXPS-SF03A-2A-CTDPFA302-streamed-ctdpf_sbe43_sample: 59.9 GB>
Dimensions: (time)
Data variables: 
    seawater_pressure
    seawater_temperature
    time

This dataset has a total size of 52.2GB

start_dt, end_dt = "2020-01-01", "2021-01-01"
%%time
ctd_ds = ctd.sel(time=slice(start_dt, end_dt)).dataset
CPU times: user 13.7 s, sys: 6.25 s, total: 19.9 s
Wall time: 23.1 s
ctd_ds
<xarray.Dataset>
Dimensions:               (time: 29403699)
Coordinates:
  * time                  (time) datetime64[ns] 2020-01-01T00:00:00.235197952...
Data variables:
    seawater_pressure     (time) float64 dask.array<chunksize=(11102469,), meta=np.ndarray>
    seawater_temperature  (time) float64 dask.array<chunksize=(11102469,), meta=np.ndarray>

There are about 29 million data points within that time range. This is huge for visualization!

We can check the size of 1 year of this dataset

print(f"This dataset size is {memory_repr(ctd_ds.nbytes)}")
This dataset size is 673.0 MB

Plotting

Now let’s try to create a depth plot (time, pressure, and temperature). We use hvPlot to perform the plotting. Using a plotting tool like matplotlib would take a really long time to plot.

hvPlot Process Diagram

To learn more click here.

Using matplotlib

fig, ax = plt.subplots()
ctd_ds.plot.scatter(x='time', y='seawater_pressure', hue='seawater_temperature', cmap=cmocean.cm.thermal)
ax.invert_yaxis()
ax.set_title('Axial Base Shallow Profiler CTD')

plt.tight_layout()
plt.savefig('ctd-profile.png', dpi=300, bbox_inches='tight', transparent=True)

For purpose of comparison, the plot above was created with matplotlib pyplot using the builtin xarray plotting function.

Using hvPlot

plot_size = (888, 450)
%%time
plot = ctd_ds.hvplot.scatter(
    x='time',
    y='seawater_pressure',
    color='seawater_temperature',
    rasterize=True,
    cmap=cmocean.cm.thermal,
    width=plot_size[0],
    height=plot_size[1],
).options(
    invert_yaxis=True,
    title='Axial Base Shallow Profiler CTD'
)
plot
CPU times: user 6.09 s, sys: 2.27 s, total: 8.36 s
Wall time: 12.8 s

The hvPlot python library is part of the HoloViz Python Visualization Tools Ecosystem. Underneath, hvPlot utilizes HoloViews and Datashader in order to create the plot. We take the resulting data from the hvPlot plot and serialize that to JSON format for our frontend visualization engine plotly to render.

You can see that the resulting datashaded plot has exactly the same pattern seen in the matplotlib plot. For example, around 9/2020 there is a warmer water at the surface. This shows the accuracy of datashading.

Datashading Pipeline

To learn more click here.

Extracting underlying dataset

plot_data = plot[()].data
plot_data
<xarray.Dataset>
Dimensions:                                      (time: 888,
                                                  seawater_pressure: 450)
Coordinates:
  * time                                         (time) datetime64[ns] 2020-0...
  * seawater_pressure                            (seawater_pressure) float64 ...
Data variables:
    time_seawater_pressure seawater_temperature  (seawater_pressure, time) float64 ...

The xarray dataset shown above is the resulting aggregated data from the datashading process that we push to the frontend application.

That’s all. That process happens with all of the datasets that we have, when the data request is large enough.