1. Standardisation (Internally-consistent format)
This document describes the standardisation step in the oceanarray processing workflow. It defines how instrument data—regardless of its original format or naming—are converted to a consistent internal structure using xarray.Dataset. This enables all subsequent processing steps (e.g. calibration, filtering, transport calculations) to run on semantically uniform data.
It corresponds to Stage 1 from RAPID data processing and management, which uses the RDB format.
1. Overview
Standardisation occurs immediately after raw data are loaded and before any scientific transformations. Its goal is to normalize:
Variable names
Dimensions
Coordinate names
Minimal core attributes (e.g., instrument, mooring, location)
This creates a uniform structure suitable for later trimming, filtering, and conversion.
2. Purpose
Remove variability in legacy file formats and naming conventions
Enable consistent handling across deployments and years
Attach minimal metadata needed for downstream tracking (e.g., serial number, water depth, instrument depth, start and end times, location and mooring names)
Provide a clean, consistent internal structure with raw data
3. Current Implementation (Stage 1)
The standardisation process is implemented in the oceanarray.stage1
module, which provides automated conversion from native instrument formats to standardised NetCDF files.
3.1. Input Formats Supported
The oceanarray.stage1.MooringProcessor
class supports multiple instrument formats:
Sea-Bird CNV files (sbe-cnv): Standard SBE 37 MicroCAT output
Sea-Bird ASCII files (sbe-asc): Alternative SBE ASCII format
Nortek AquaDopp (nortek-aqd): Current meter data with header files
RBR RSK files (rbr-rsk): RBR logger binary format
RBR ASCII files (rbr-dat): RBR text output
Each format is handled by specialized readers from the ctd_tools library that parse instrument-specific headers and data structures.
3.2. Processing Workflow
The standardisation process follows these steps:
Configuration Loading: Read YAML configuration files containing:
Mooring metadata (location, water depth, deployment times)
Instrument specifications (serial numbers, depths, file locations)
Clock offsets and timing corrections (not applied in stage 1)
Data Reading: Use appropriate readers to parse native instrument files:
Extract time series data
Parse instrument metadata from headers
Handle instrument-specific data structures
Variable Standardisation: Convert to consistent naming:
Remove derived variables (e.g., potential temperature, density)
Standardise coordinate names
Apply consistent units and metadata
Metadata Integration: Add standardised attributes:
Global mooring information
Instrument-specific metadata
Deployment and recovery information
Quality control flags and processing history
NetCDF Output: Write to compressed, chunked NetCDF files with:
Optimised data types (float32, uint8 for flags)
Time-based chunking for efficient access
CF-compliant metadata structure
3.3. Configuration Format
Mooring configurations are defined in YAML files with the following structure:
name: mooring_name
waterdepth: 1000
longitude: -30.0
latitude: 60.0
deployment_time: '2018-05-01T12:00:00'
recovery_time: '2019-05-01T12:00:00'
directory: 'moor/raw/deployment_name/'
instruments:
- instrument: microcat
serial: 12345
depth: 100
filename: 'data_file.cnv'
file_type: 'sbe-cnv'
clock_offset: 0
Here the name
is the unique mooring identifier (for example, for mooring “DS E” deployed in 2018 for the first time, we have “dsE_1_2018”.
The waterdepth
is the nominal water depth at the mooring site in metres.
The longitude
and latitude
are the mooring coordinates in decimal degrees.
The deployment_time
and recovery_time
are the UTC timestamps for the mooring deployment and recovery, with format YYYY-MM-DDTHH:MM:SS.
The directory
is the path to the raw data files for this deployment. According to our convention, this is in a directory named moor/raw/deployment_name/, where deployment_name is a unique identifier for the cruise on which the mooring was recovered.
The instruments
list contains one entry per instrument on the mooring, with:
instrument
: The type of instrument (e.g.,microcat
,aquadopp
,rbr
)serial
: The instrument’s serial numberdepth
: The instrument’s nominal depth in metresfilename
: The name of the raw data filefile_type
: The type of the raw data file (e.g.,sbe-cnv
,rbr-rsk
)clock_offset
(optional, defaults to zero): The clock offset for the instrument in seconds.
3.4. Usage Example
from oceanarray.stage1 import MooringProcessor, process_multiple_moorings
# Process a single mooring
processor = MooringProcessor('/path/to/data/')
success = processor.process_mooring('mooring_name')
# Process multiple moorings
moorings = ['mooring1', 'mooring2', 'mooring3']
results = process_multiple_moorings(moorings, '/path/to/data/')
4. Output Format
The standardised output is a raw-equivalent xarray.Dataset with:
Dimensions: time (primary coordinate)
Variables: Standardised names (e.g., temperature, salinity, pressure)
Coordinates: time, plus instrument metadata variables
Attributes: Comprehensive metadata including instrument, mooring, and deployment information
Example output structure:
<xarray.Dataset>
Dimensions: (time: 124619)
Coordinates:
* time (time) datetime64[ns] 2018-08-12T08:00:01 ... 2018-08-26T20:47:24
Data variables:
temperature (time) float32 ...
salinity (time) float32 ...
pressure (time) float32 ...
serial_number int64 7518
InstrDepth int64 100
instrument <U8 'microcat'
clock_offset int64 0
start_time <U19 '2018-08-12T08:00:00'
end_time <U19 '2018-08-26T20:47:24'
Attributes:
mooring_name: test_mooring
waterdepth: 1000
longitude: -30.0
latitude: 60.0
deployment_time: 2018-08-12T08:00:00
recovery_time: 2018-08-26T20:47:24
5. Quality Control and Error Handling
The standardisation process includes robust error handling:
Missing Files: Graceful handling with detailed logging
Format Errors: Reader-specific error catching and reporting
Metadata Validation: Checks for required configuration fields
Output Verification: Ensures NetCDF files are created successfully
All processing activities are logged to timestamped log files for debugging and audit trails.
6. Historical Context: RAPID RDB Format
This standardisation step evolved from the RAPID programme’s RDB format, which provided a similar function for historical mooring data. An example of the original RDB format structure:
Mooring = wb2_16_2020
SerialNumber = 5768
WaterDepth = 3916
InstrDepth = 1700
Start_Date = 2021/01/05
Start_Time = 20:00
End_Date = 2023/02/25
End_Time = 17:00
Latitude = 26 31.000 N
Longitude = 076 44.460 W
Columns = YY:MM:DD:HH:T:C:P
2021 01 05 20.00056 3.9239 33.2233 1726.4
2021 01 05 21.00056 3.9389 33.2389 1725.4
2021 01 05 22.00056 3.9389 33.2405 1725.3
The modern NetCDF-based approach provides several advantages over the historical RDB format:
Self-describing metadata: CF-compliant attributes and coordinate information
Efficient storage: Compression and chunking for large datasets
Software compatibility: Wide support across analysis tools and languages
Extensibility: Easy addition of new variables and metadata
7. Legacy Processing Scripts
The original RAPID processing chain used MATLAB scripts to convert instrument data to RDB format. The key script was:
This script performed similar functions to the current Python implementation:
1% function microcat2rodb_3('infile','outfile','infofile',fidlog,[graphics],[toffset])
2%
3% reads ACSII output from SBE-37 MicroCAT and converts
4% writes it to RODB file
5% So far the input format contain (temp,cond,day,month,year,time)
6% or .cnv format (temp,cond,press,seconds,data_flag)
7% where seconds is seconds since Jan 1st 2000.
8%
9% input:
10% outfile : path and name of RODB outputfile
11% infile : path and name of MicroCAT ASCII input file
12% infofile : path and name of mooring info.dat file
13% fidlog : file identifier for log file
14% graphics : 'y' = display some graphics (mooring)
15% 'w' = display some graphics with whole range (rosette)
16% toffset : offset of recorded instrument time rel to GMT (decimal days),
17% if omitted, default toffset = 0 is set
18%
19% uses: microcat_month2.m, hms2h.m (all by T.Kanzow), rodbsave.m
20%
21% kanzow 26.12.2000 Xmas edition
22% 28.03.2001 pressure option added
23% 20.04.2005 CSIRO seawater routine for computation of salinity
24% 24.04.2005 recorded time checked against mooring deployment time
25% 25.04.2005 input variable 'toffset' added
26% 27/10/09 - DR added functionality for .cnv format files for
27% microcat firmware 3.0 and above. and created microcat2rodb_3
28% from microcat2rodb_2_002
29% 03.04.2010 - ZBS changed the method for setting 'ylim'
30% for graphics=='y' to use prctile.
31%
32function microcat2rodb_3(infile,outfile,infofile,fidlog,graphics,toffset)
33
34if nargin < 4
35 disp('not enough input arguments')
36elseif nargin == 4
37 graphics = 'n'
38 disp('graphics will not be displayed')
39end
The modern Python implementation in oceanarray.stage1
provides equivalent functionality with improved:
Error handling: Comprehensive logging and graceful failure modes
Format support: Multiple instrument types through pluggable readers
Metadata management: YAML-based configuration with validation
Output formats: NetCDF with CF conventions for interoperability
8. Integration with Processing Chain
Standardised files from Stage 1 serve as input to subsequent processing steps:
Stage 2: 2. Trimming to Deployed period - Remove pre/post deployment data
Stage 3: 3. Calibration (Instrument-level Corrections) - Apply post-cruise calibration offsets
Later stages: filtering, gridding, stitching for array products
The consistent structure created during standardisation ensures that all downstream processing tools can operate on any instrument dataset without format-specific modifications.
See also: oceanarray API, 2. Trimming to Deployed period, 3. Calibration (Instrument-level Corrections)