1. Standardisation (Internally-consistent format)

This document describes the standardisation step in the oceanarray processing workflow. It defines how instrument data—regardless of its original format or naming—are converted to a consistent internal structure using xarray.Dataset. This enables all subsequent processing steps (e.g. calibration, filtering, transport calculations) to run on semantically uniform data.

It corresponds to Stage 1 from RAPID data processing and management, which uses the RDB format.

1. Overview

Standardisation occurs immediately after raw data are loaded and before any scientific transformations. Its goal is to normalize:

Variable names
Dimensions
Coordinate names
Minimal core attributes (e.g., instrument, mooring, location)

This creates a uniform structure suitable for later trimming, filtering, and conversion.

2. Purpose

Remove variability in legacy file formats and naming conventions
Enable consistent handling across deployments and years
Attach minimal metadata needed for downstream tracking (e.g., serial number, water depth, instrument depth, start and end times, location and mooring names)
Provide a clean, consistent internal structure with raw data

3. Current Implementation (Stage 1)

The standardisation process is implemented in the oceanarray.stage1 module, which provides automated conversion from native instrument formats to standardised NetCDF files.

3.1. Input Formats Supported

The oceanarray.stage1.MooringProcessor class supports multiple instrument formats:

Sea-Bird CNV files (sbe-cnv): Standard SBE 37 MicroCAT output
Sea-Bird ASCII files (sbe-asc): Alternative SBE ASCII format
Nortek AquaDopp (nortek-aqd): Current meter data with header files
RBR RSK files (rbr-rsk): RBR logger binary format
RBR ASCII files (rbr-dat): RBR text output

Each format is handled by specialized readers from the ctd_tools library that parse instrument-specific headers and data structures.

3.2. Processing Workflow

The standardisation process follows these steps:

Configuration Loading: Read YAML configuration files containing:
- Mooring metadata (location, water depth, deployment times)
- Instrument specifications (serial numbers, depths, file locations)
- Clock offsets and timing corrections (not applied in stage 1)
Data Reading: Use appropriate readers to parse native instrument files:
- Extract time series data
- Parse instrument metadata from headers
- Handle instrument-specific data structures
Variable Standardisation: Convert to consistent naming:
- Remove derived variables (e.g., potential temperature, density)
- Standardise coordinate names
- Apply consistent units and metadata
Metadata Integration: Add standardised attributes:
- Global mooring information
- Instrument-specific metadata
- Deployment and recovery information
- Quality control flags and processing history
NetCDF Output: Write to compressed, chunked NetCDF files with:
- Optimised data types (float32, uint8 for flags)
- Time-based chunking for efficient access
- CF-compliant metadata structure

3.3. Configuration Format

Mooring configurations are defined in YAML files with the following structure:

name: mooring_name
waterdepth: 1000
longitude: -30.0
latitude: 60.0
deployment_time: '2018-05-01T12:00:00'
recovery_time: '2019-05-01T12:00:00'
directory: 'moor/raw/deployment_name/'
instruments:
  - instrument: microcat
    serial: 12345
    depth: 100
    filename: 'data_file.cnv'
    file_type: 'sbe-cnv'
    clock_offset: 0

Here the name is the unique mooring identifier (for example, for mooring “DS E” deployed in 2018 for the first time, we have “dsE_1_2018”.

The waterdepth is the nominal water depth at the mooring site in metres.

The longitude and latitude are the mooring coordinates in decimal degrees.

The deployment_time and recovery_time are the UTC timestamps for the mooring deployment and recovery, with format YYYY-MM-DDTHH:MM:SS.

The directory is the path to the raw data files for this deployment. According to our convention, this is in a directory named moor/raw/deployment_name/, where deployment_name is a unique identifier for the cruise on which the mooring was recovered.

The instruments list contains one entry per instrument on the mooring, with:

instrument: The type of instrument (e.g., microcat, aquadopp, rbr)
serial: The instrument’s serial number
depth: The instrument’s nominal depth in metres
filename: The name of the raw data file
file_type: The type of the raw data file (e.g., sbe-cnv, rbr-rsk)
clock_offset (optional, defaults to zero): The clock offset for the instrument in seconds.

3.4. Usage Example

from oceanarray.stage1 import MooringProcessor, process_multiple_moorings

# Process a single mooring
processor = MooringProcessor('/path/to/data/')
success = processor.process_mooring('mooring_name')

# Process multiple moorings
moorings = ['mooring1', 'mooring2', 'mooring3']
results = process_multiple_moorings(moorings, '/path/to/data/')

4. Output Format

The standardised output is a raw-equivalent xarray.Dataset with:

Dimensions: time (primary coordinate)
Variables: Standardised names (e.g., temperature, salinity, pressure)
Coordinates: time, plus instrument metadata variables
Attributes: Comprehensive metadata including instrument, mooring, and deployment information

Example output structure:

<xarray.Dataset>
Dimensions:        (time: 124619)
Coordinates:
  * time           (time) datetime64[ns] 2018-08-12T08:00:01 ... 2018-08-26T20:47:24
Data variables:
    temperature    (time) float32 ...
    salinity       (time) float32 ...
    pressure       (time) float32 ...
    serial_number  int64 7518
    InstrDepth     int64 100
    instrument     <U8 'microcat'
    clock_offset   int64 0
    start_time     <U19 '2018-08-12T08:00:00'
    end_time       <U19 '2018-08-26T20:47:24'
Attributes:
    mooring_name:           test_mooring
    waterdepth:             1000
    longitude:              -30.0
    latitude:               60.0
    deployment_time:        2018-08-12T08:00:00
    recovery_time:          2018-08-26T20:47:24

5. Quality Control and Error Handling

The standardisation process includes robust error handling:

Missing Files: Graceful handling with detailed logging
Format Errors: Reader-specific error catching and reporting
Metadata Validation: Checks for required configuration fields
Output Verification: Ensures NetCDF files are created successfully

All processing activities are logged to timestamped log files for debugging and audit trails.

6. Historical Context: RAPID RDB Format

This standardisation step evolved from the RAPID programme’s RDB format, which provided a similar function for historical mooring data. An example of the original RDB format structure:

Mooring              = wb2_16_2020
SerialNumber         = 5768
WaterDepth           = 3916
InstrDepth           = 1700
Start_Date           = 2021/01/05
Start_Time           = 20:00
End_Date             = 2023/02/25
End_Time             = 17:00
Latitude             = 26 31.000 N
Longitude            = 076 44.460 W
Columns              = YY:MM:DD:HH:T:C:P
2021   01   05   20.00056   3.9239   33.2233  1726.4
2021   01   05   21.00056   3.9389   33.2389  1725.4
2021   01   05   22.00056   3.9389   33.2405  1725.3

The modern NetCDF-based approach provides several advantages over the historical RDB format:

Self-describing metadata: CF-compliant attributes and coordinate information
Efficient storage: Compression and chunking for large datasets
Software compatibility: Wide support across analysis tools and languages
Extensibility: Easy addition of new variables and metadata

7. Legacy Processing Scripts

The original RAPID processing chain used MATLAB scripts to convert instrument data to RDB format. The key script was:

microcat2rodb_3.m

This script performed similar functions to the current Python implementation:

Excerpt from microcat2rodb_3.m

% function microcat2rodb_3('infile','outfile','infofile',fidlog,[graphics],[toffset])
%
% reads ACSII output from SBE-37 MicroCAT and converts
% writes it to RODB file 
% So far the input format contain (temp,cond,day,month,year,time)  
% or .cnv format (temp,cond,press,seconds,data_flag)
% where seconds is seconds since Jan 1st 2000.
%
% input: 
%   outfile  : path and name of RODB outputfile
%   infile   : path and name of MicroCAT ASCII input file
%   infofile : path and name of mooring info.dat file  
%   fidlog   : file identifier for log file
%   graphics : 'y' = display some graphics (mooring)
%              'w' = display some graphics with whole range (rosette)
%   toffset  : offset of recorded instrument time rel to GMT (decimal days),
%              if omitted, default toffset = 0 is set 
%
%  uses: microcat_month2.m, hms2h.m  (all by T.Kanzow), rodbsave.m
%
% kanzow   26.12.2000 Xmas edition
%          28.03.2001 pressure option added 
%          20.04.2005 CSIRO seawater routine for computation of salinity 
%          24.04.2005 recorded time checked against mooring deployment time 
%          25.04.2005 input variable 'toffset' added
%          27/10/09 - DR added functionality for .cnv format files for
%             microcat firmware 3.0 and above. and created microcat2rodb_3
%             from microcat2rodb_2_002
%          03.04.2010 - ZBS changed the method for setting 'ylim'
%             for graphics=='y' to use prctile.
%
function microcat2rodb_3(infile,outfile,infofile,fidlog,graphics,toffset)

if nargin < 4
  disp('not enough input arguments')
elseif nargin == 4
  graphics = 'n'
  disp('graphics will not be  displayed')
end

The modern Python implementation in oceanarray.stage1 provides equivalent functionality with improved:

Error handling: Comprehensive logging and graceful failure modes
Format support: Multiple instrument types through pluggable readers
Metadata management: YAML-based configuration with validation
Output formats: NetCDF with CF conventions for interoperability

8. Integration with Processing Chain

Standardised files from Stage 1 serve as input to subsequent processing steps:

Stage 2: 2. Trimming to Deployed period - Remove pre/post deployment data
Stage 3: 3. Calibration (Instrument-level Corrections) - Apply post-cruise calibration offsets
Later stages: filtering, gridding, stitching for array products

The consistent structure created during standardisation ensures that all downstream processing tools can operate on any instrument dataset without format-specific modifications.