1. Standardisation (Internally-consistent format) ================================================ This document describes the **standardisation** step in the oceanarray processing workflow. It defines how instrument data—regardless of its original format or naming—are converted to a consistent internal structure using `xarray.Dataset`. This enables all subsequent processing steps (e.g. calibration, filtering, transport calculations) to run on semantically uniform data. It corresponds to **Stage 1** from RAPID data processing and management, which uses the RDB format. 1. Overview ----------- Standardisation occurs **immediately after raw data are loaded** and before any scientific transformations. Its goal is to normalize: - Variable names - Dimensions - Coordinate names - Minimal core attributes (e.g., instrument, mooring, location) This creates a uniform structure suitable for later trimming, filtering, and conversion. 2. Purpose ---------- - Remove variability in legacy file formats and naming conventions - Enable consistent handling across deployments and years - Attach minimal metadata needed for downstream tracking (e.g., serial number, water depth, instrument depth, start and end times, location and mooring names) - Provide a clean, consistent internal structure with raw data 3. Current Implementation (Stage 1) ----------------------------------- The standardisation process is implemented in the :mod:`oceanarray.stage1` module, which provides automated conversion from native instrument formats to standardised NetCDF files. 3.1. Input Formats Supported ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :class:`oceanarray.stage1.MooringProcessor` class supports multiple instrument formats: - **Sea-Bird CNV files** (`sbe-cnv`): Standard SBE 37 MicroCAT output - **Sea-Bird ASCII files** (`sbe-asc`): Alternative SBE ASCII format - **Nortek AquaDopp** (`nortek-aqd`): Current meter data with header files - **RBR RSK files** (`rbr-rsk`): RBR logger binary format - **RBR ASCII files** (`rbr-dat`): RBR text output Each format is handled by specialized readers from the `ctd_tools` library that parse instrument-specific headers and data structures. 3.2. Processing Workflow ^^^^^^^^^^^^^^^^^^^^^^^^^ The standardisation process follows these steps: 1. **Configuration Loading**: Read YAML configuration files containing: - Mooring metadata (location, water depth, deployment times) - Instrument specifications (serial numbers, depths, file locations) - Clock offsets and timing corrections (not applied in stage 1) 2. **Data Reading**: Use appropriate readers to parse native instrument files: - Extract time series data - Parse instrument metadata from headers - Handle instrument-specific data structures 3. **Variable Standardisation**: Convert to consistent naming: - Remove derived variables (e.g., potential temperature, density) - Standardise coordinate names - Apply consistent units and metadata 4. **Metadata Integration**: Add standardised attributes: - Global mooring information - Instrument-specific metadata - Deployment and recovery information - Quality control flags and processing history 5. **NetCDF Output**: Write to compressed, chunked NetCDF files with: - Optimised data types (float32, uint8 for flags) - Time-based chunking for efficient access - CF-compliant metadata structure 3.3. Configuration Format ^^^^^^^^^^^^^^^^^^^^^^^^^^ Mooring configurations are defined in YAML files with the following structure: .. code-block:: yaml name: mooring_name waterdepth: 1000 longitude: -30.0 latitude: 60.0 deployment_time: '2018-05-01T12:00:00' recovery_time: '2019-05-01T12:00:00' directory: 'moor/raw/deployment_name/' instruments: - instrument: microcat serial: 12345 depth: 100 filename: 'data_file.cnv' file_type: 'sbe-cnv' clock_offset: 0 Here the ``name`` is the unique mooring identifier (for example, for mooring "DS E" deployed in 2018 for the first time, we have "dsE_1_2018". The ``waterdepth`` is the nominal water depth at the mooring site in metres. The ``longitude`` and ``latitude`` are the mooring coordinates in decimal degrees. The ``deployment_time`` and ``recovery_time`` are the UTC timestamps for the mooring deployment and recovery, with format `YYYY-MM-DDTHH:MM:SS`. The ``directory`` is the path to the raw data files for this deployment. According to our convention, this is in a directory named `moor/raw/deployment_name/`, where `deployment_name` is a unique identifier for the cruise on which the mooring was recovered. The ``instruments`` list contains one entry per instrument on the mooring, with: - ``instrument``: The type of instrument (e.g., ``microcat``, ``aquadopp``, ``rbr``) - ``serial``: The instrument's serial number - ``depth``: The instrument's nominal depth in metres - ``filename``: The name of the raw data file - ``file_type``: The type of the raw data file (e.g., ``sbe-cnv``, ``rbr-rsk``) - ``clock_offset`` (optional, defaults to zero): The clock offset for the instrument in seconds. 3.4. Usage Example ^^^^^^^^^^^^^^^^^^ .. code-block:: python from oceanarray.stage1 import MooringProcessor, process_multiple_moorings # Process a single mooring processor = MooringProcessor('/path/to/data/') success = processor.process_mooring('mooring_name') # Process multiple moorings moorings = ['mooring1', 'mooring2', 'mooring3'] results = process_multiple_moorings(moorings, '/path/to/data/') 4. Output Format ---------------- The standardised output is a **raw-equivalent** `xarray.Dataset` with: - **Dimensions**: `time` (primary coordinate) - **Variables**: Standardised names (e.g., `temperature`, `salinity`, `pressure`) - **Coordinates**: `time`, plus instrument metadata variables - **Attributes**: Comprehensive metadata including instrument, mooring, and deployment information Example output structure: .. code-block:: python Dimensions: (time: 124619) Coordinates: * time (time) datetime64[ns] 2018-08-12T08:00:01 ... 2018-08-26T20:47:24 Data variables: temperature (time) float32 ... salinity (time) float32 ... pressure (time) float32 ... serial_number int64 7518 InstrDepth int64 100 instrument `__ This script performed similar functions to the current Python implementation: .. literalinclude:: ../_static/code/microcat2rodb_3.m :language: matlab :lines: 1-40 :linenos: :caption: Excerpt from `microcat2rodb_3.m` The modern Python implementation in :mod:`oceanarray.stage1` provides equivalent functionality with improved: - **Error handling**: Comprehensive logging and graceful failure modes - **Format support**: Multiple instrument types through pluggable readers - **Metadata management**: YAML-based configuration with validation - **Output formats**: NetCDF with CF conventions for interoperability 8. Integration with Processing Chain ------------------------------------ Standardised files from Stage 1 serve as input to subsequent processing steps: - **Stage 2**: :doc:`trimming` - Remove pre/post deployment data - **Stage 3**: :doc:`calibration` - Apply post-cruise calibration offsets - **Later stages**: :doc:`filtering`, :doc:`gridding`, :doc:`stitching` for array products The consistent structure created during standardisation ensures that all downstream processing tools can operate on any instrument dataset without format-specific modifications. See also: :doc:`../oceanarray`, :doc:`trimming`, :doc:`calibration`