2. Trimming to Deployed period

This document outlines the trimming step in the oceanarray processing workflow. Trimming refers to the process of isolating the valid deployment period from a time series—typically corresponding to the interval between instrument deployment and recovery. This step also applies clock corrections and produces standardized NetCDF files ready for further processing.

This step may need to be run several times, adjusting the start and end of the deployment periods based on data inspection.

It corresponds to Stage 2 from RAPID data processing and management, which uses as input the RDB format and outputs (in RDB format) the *.use file. Here, we start with the Stage 1 output *_raw.nc files and produce *_use.nc files.

1. Overview

Raw mooring records often contain extraneous data before deployment or after recovery (e.g., deck recording, values during ascent/descent, post-recovery handling). These segments must be trimmed to retain only the time interval when the instrument was collecting valid in-situ measurements at the nominal depth during deployment. In this stage:

Clock offsets are applied to correct instrument timing
Data is trimmed to deployment/recovery time windows
Unnecessary variables are removed
Metadata is enriched and standardized
Files are converted from *_raw.nc to *_use.nc format

2. Purpose

Apply clock corrections to synchronize instrument times
Remove non-oceanographic data before deployment and after recovery
Ensure temporal consistency across instruments on the same mooring
Define the usable deployment interval for each instrument
Standardize file naming and metadata structure

3. Current Implementation (Stage 2)

The trimming process is implemented in the oceanarray.stage2 module, which provides automated clock correction and temporal trimming for mooring datasets.

3.1. Input Requirements

The oceanarray.stage2.Stage2Processor class processes Stage 1 output files:

Raw NetCDF files (*_raw.nc): Standardized files from Stage 1 processing
YAML configuration: Mooring metadata including deployment times and clock offsets
Deployment windows: Start and end times for valid data periods

3.2. Processing Workflow

The Stage 2 processing follows these steps:

Configuration Loading: Read YAML files containing: - Deployment and recovery timestamps - Instrument-specific clock offsets - Mooring location and metadata
Clock Offset Application: Correct instrument timing by: - Adding clock offset variables to datasets - Shifting time coordinates by specified offset amounts - Preserving original datasets (immutable processing)
Temporal Trimming: Remove invalid data by: - Trimming to deployment time window (if specified) - Trimming to recovery time window (if specified) - Handling missing or invalid time bounds gracefully
Metadata Enrichment: Ensure complete metadata by: - Adding missing instrument depth, serial number, type - Preserving existing metadata without overwriting - Maintaining provenance information
Variable Cleanup: Remove unnecessary variables: - Legacy time variables (e.g., timeS) - Derived variables not needed for further processing
NetCDF Output: Write processed files with: - Updated filename suffix (*_use.nc) - Optimized compression and chunking - CF-compliant metadata structure

3.3. Configuration Format

Clock offsets and deployment times are specified in YAML configuration:

name: mooring_name
deployment_time: '2018-08-12T08:00:00'
recovery_time: '2018-08-26T20:47:24'
instruments:
  - instrument: microcat
    serial: 7518
    depth: 100
    clock_offset: 300  # seconds

3.4. Usage Example

from oceanarray.stage2 import Stage2Processor, process_multiple_moorings_stage2

# Process a single mooring
processor = Stage2Processor('/path/to/data/')
success = processor.process_mooring('mooring_name')

# Process multiple moorings
moorings = ['mooring1', 'mooring2', 'mooring3']
results = process_multiple_moorings_stage2(moorings, '/path/to/data/')

4. Output Format

The processed output includes:

Trimmed NetCDF files (*_use.nc) with: - Time coordinates corrected for clock offsets - Data restricted to deployment period - Standardized metadata and variable structure - Optimized storage format
Processing logs with detailed information about: - Clock offsets applied - Trimming operations performed - Data volume changes - Error conditions encountered

Example output structure:

<xarray.Dataset>
Dimensions:        (time: 8640)  # Trimmed from original 124619
Coordinates:
  * time           (time) datetime64[ns] 2018-08-12T08:05:00 ... 2018-08-26T19:55:00
Data variables:
    temperature    (time) float32 ...
    salinity       (time) float32 ...
    pressure       (time) float32 ...
    serial_number  int64 7518
    InstrDepth     int64 100
    instrument     <U8 'microcat'
    clock_offset   int64 300
Attributes:
    mooring_name:           test_mooring
    deployment_time:        2018-08-12T08:00:00
    recovery_time:          2018-08-26T20:47:24

5. Quality Control and Error Handling

Stage 2 processing includes comprehensive quality control:

Time Range Validation: Ensures deployment/recovery times are reasonable
Clock Offset Verification: Logs all timing corrections applied
Data Completeness Checks: Warns when trimming results in empty datasets
File Integrity: Validates input files before processing
Graceful Degradation: Continues processing other instruments if one fails

All processing activities are logged with timestamps for audit trails and debugging.

6. Integration with Processing Chain

Stage 2 processed files serve as input to subsequent processing steps:

Stage 3: Further quality control and calibration applications
Later stages: filtering, gridding, stitching for array products

The consistent structure and timing corrections applied during Stage 2 ensure that downstream processing tools can operate reliably across different instrument types and deployments.

7. Historical Context: RAPID Processing

This trimming step evolved from the RAPID programme’s Stage 2 processing, which converted RDB format files from *.raw to *.use format. The modern implementation provides equivalent functionality with several improvements:

Automated Processing: Batch processing of multiple instruments and moorings
Enhanced Metadata: Comprehensive provenance and quality control information
Flexible Configuration: YAML-based configuration for easy modification
Error Recovery: Robust handling of missing files and invalid configurations
Modern Formats: NetCDF output with CF conventions for interoperability

8. Legacy Processing Scripts

The original RAPID processing chain used MATLAB scripts for trimming operations:

microcat_raw2use_003.m

This script performed similar functions to the current Python implementation:

Excerpt from microcat_raw2use_003.m

clear
% basic preprocessing for microcat data
%   
% features
%      1. eliminate lauching and recovery period
%      2. save data to rodb file
%      3. create data overview sheet
%
% uses timeaxis.m, auto_filt.m, julian.m 

% 11.01.01 Kanzow 
% 13.08.02 Kanzow : debugged
%
%      
  
% --- get mooring information from infofile 


moor          = 'ebm1_2_200736';    % Mooring name

operator      = 'wallace';

plot_interval = [2007 11 07 0;   % start time of time axis on plot
		 2008 11 20 0];  % end time of time axis on plot

mc_id    = [333 337] ;         % microcat id

%-- path

% -- set path for data input and output

% inpath  = ['/data/rapid/cd170/moorings/',moor,'/microcat/'];
% outpath = ['/data/rapid/cd170/moorings/',moor,'/microcat/'];
% infofile =['/data/rapid/cd170/moorings/',moor,'/',moor,'info.dat'];

inpath  = ['~/Data/rpdmoc/rapid/data/moor/proc/',moor,'/microcat/']; % C Wallace changed paths for cruise d334
outpath  = ['~/Data/rpdmoc/rapid/data/moor/proc/',moor,'/microcat/'];
infofile = ['~/Data/rpdmoc/rapid/data/moor/proc/',moor,'/',moor,'info.dat'];

The modern Python implementation in oceanarray.stage2 provides equivalent functionality with improved:

Scalability: Process multiple moorings and years of data efficiently
Reproducibility: Automated processing with comprehensive logging
Flexibility: Easy modification of deployment windows and clock offsets
Integration: Seamless connection to subsequent processing stages

9. Implementation Notes

Time Handling: All times are processed as UTC datetime64 objects for consistency
Clock Corrections: Applied as integer second offsets to maintain precision
Immutable Processing: Original datasets are preserved; all operations return new objects
Flexible Trimming: Supports trimming by deployment time, recovery time, or both
Batch Processing: Efficient processing of multiple instruments and moorings

10. FAIR Considerations

Findable: Standardized file naming and metadata structure
Accessible: NetCDF format with CF conventions for broad compatibility
Interoperable: Consistent data structure across instruments and deployments
Reusable: Comprehensive metadata and processing provenance

Trimmed intervals and clock corrections are documented transparently in dataset attributes and processing logs to maintain full provenance.