4/27/23 CELR Data Lake Migration Meeting notes

Date

Apr 27, 2023

Attendees:

APHL

CDC

Peraton

APHL

CDC

Peraton

Dari Shirazi: X

Megan Light: --

Erroll Rosser: x

Vanessa Holley : x

Teresa Jue: --

Kristin Peterson :

Brooke Beaulieu: x

Cheri Gatland-Lightener: x

Tom Russell:

Mel Kourbage: x

Norris Kpamegan : x

Marcelo Caldas: X

Gretl Glick: x

 

Don Lindsay: x

 

 

 

Goals

  • Update on Data Lake Migration Status

  • DEX Overview and Status Update

Discussion topics

Item

Notes

Item

Notes

Welcome & Introduction

 

  • Project Status:

    • Data Lake Migration Background:

    • 2 main pieces to migration:

      • 1- Capability to process data in Azure at CDC

      • 2-Integration with DEX at CDC Front Door

      • 1-CELR Pipeline-Validation of HL7, CSV for CV19 info into merged Lab Record from both formats

        • 4 major areas: Focus on Pipeline--Key Activities

        • 1-Re-engineer CELR CSV and HL7 Pipelines from Amazon Workspace (AWS) to CDC Azure Cloud.

        • 2-Splitting pieces: Migrate Ingest, Validation, and Redact functionality for COVID-19 Lab Records submitted to CDC from CELR to Data Exchange (DEX).

        • 3-Develop transformation and lab result algorithms for Covid-19 Lab Results submitted to CDC

        • 4-Re-engineer CELR Portal within CDC Azure Cloud

        • CDC EOY: June 30, 2023 Target end of fiscal year

      • Dependencies (as of 3/6/23):

        • DEX--must have ability to ingest data from AIMS, AND storing/receiving in consistent format in how current CV19 Data is ingested

        • Pilot/test timeframes within DEX Team will hopefully be shared (quarterly review week of 3/16)

        • Dex-focus has been on HL7 Pipeline dev, has not yet focused on CSV Dev pipeline (bulk of data for CELR is submitted via CSV currently)

    • Dependencies long-term:

      • AIMS to DEX connectivity, what data will need to be received in which environment, transport protocols, pilot testing

Overall Status Updates

  • DEX Development, Status Updates, and Timeline:

    •  

  • DEX & AIMS team to discuss provisioning S3 Bucket

    • DEX: Handling ingestion and validation services, receiving Immunization data from IZ Gateway (prod data)--sending via API

    • Ask: FROM aims--API HL7 and CSV INGESTION, HL7 V2 Pipeline in progress--Sept 23, testing; testing with HL7--consuming lab and case data, should be ready for Q4

      • CSV Pipeline just starting to be built, CV, GENV2; Timeline TBD for ingesting prod data

    • AIMS: What about existing data (historic data)?

      • ER: Redshift data for CELR? OR Kafka? Ingress S3 bucket for CELR, S3 bucket for parsed data for Redshift

      • DS: S3 Buckets parsed (both CSV and HL7)--how are we sending existing data to CDC? 1 HL7 message at time? Connecting to existing S3?

      • ER: Need to discuss and determine historic data migration plans; existing HL7 raw data wants to be ingested into DEX ; will need to discuss transmission of data into DEX; egress data which has been validated, translated/transformed--this data we do not want to come into dex--Need to determine if that is migrated to EDAV; different S3 buckets will be migrated into different systems

      • DS: Future data: will go into DEX (correct); historic data (4.5 billion data for CELR)--will need determine different migration plan (this is probably out of scope for DEX)

      • LM: Need raw data for DEX, would need to be able to test with data, how AIMS will onboard with data api

      • DS: Test data: could try to send through pipeline, but very few test cases/low volume--few/limited test data set to send (CELR production data was very quickly spun up)

        • If CDC testing data is secure, could try to send prod data to endpoint

        • LM: Prod data can be sent to staging environment (ATO in place in CDC)

      • GM/AIMS: No comments yet

      • LM/CDC: What does test file look like? Could potentially phase testing if small volume--would be good to have both CSV and HL7 files to be able to test

        • Erroll: Files in CELR, raw HL7 or raw batch HL7, or CSV file

          • Encryption--do not think it is applied to data for CELR

      • AIMS: CSV processing has improved over the past 3 years--do you want to re-process older files? How clean should this data be? Consideration for data migration plans/processes

        • ER: Not certain about file validation which is being built in CSV pipeline, lessons learned over the past few years--impacting performance and data quality; processing has been refined; Erroll had previously provided validation requirements to CDC for review

        • CDC: Ongoing discussion as to how to handle, still trying to determine enterprise level CSV Validation services versus programmatic validation (content/data quality validation)--need to determine/tease out before we decide on re-processing of data

        • Peraton: Balance b/w csv structure validation and data quality/content validation was important for CELR, case by case;

      • AIMS: Testing with small data streams; will need to test size of batch HL7 files--how large can data api accept? Performance/speed considerations (CA File sizes are VERY LARGE)--would want to test prior to prod cutover

      • CDC: Data APIs: No hard limit on data apis, resumable as well; unless files are larger than 100gb, would not expect to run into size limit; soft guidance--anything under 10GB should be ok; 10-100 GB--should work, but would monitor; More than 100gb--would want to consider other transport mechanisms; terabyte size…

        • Assumption is that individual files are under 10gb

        • Should not be hitting firewall/timeout limit

      • AIMS: File size limit has been challenge in past, so would want to test; some of the challenges would be around firewalls, timeouts

        • CA: 700-800MB files, CDC: should not be issue for data api

      • CDC: In migration plan, is there a preferred transit mechanism b/w AIMS & DEX? Going forward, does AIMS have preference for sending HL7 to DEX?

        • ER: Not yet determined, data currently sitting in S3 buckets, would need to discuss best avenue to transport data

        • AIMS:

        • CDC: Just discussing this as a first use case for CV19 ELR data, but need to consider other use cases --and if we need to consider other solutions/other data streams if a data api would be sufficient?

        • AIMS: Preference would be S3--secure,

        • CDC: From AIMS…to CDC?

        • AIMS-- does CDC want us to use data api? or are we considering other transit mechanisms? makes sense to concentrate on 1 technical solution for all data streams

        • CDC: Uncertain as to number of potential data streams coming from AIMS to CDC; since APHL is a huge source of data streams, does it make more sense to set up bucket system (would usually not do this for lower volume systems); CDC cannot maintain bucket to bucket connections for ALL Small volume data senders--data api was identified as a potential service for lower volume; open to exploring other options for large volume senders (such as APHL)

        •  

          • Options: Upload Data API (advantage is that it is integrated with SAMS for identification/verification system; metadata minimum enforced)

          • Option: S3 Bucket to bucket

          • Option: VERY LARGE VOLUME

        • AIMS: Preference would be S3 bucket API, would need to review documentation

      • CDC: How many transport streams go into APHL?

        • aphl: quite a few--PHINMS, Mirth, S3 EIP+, ELIMS

        • Long-term--would like to fully retire PHINMS

      • CDC: Would you be willing to try upload API? AIMS: Sure, we can try--any api, would need to ensure that we re-try submission if they fail; an advantage of S3 is that this is already working; so would need to identify how to do so with other API services; can review documentation if available

      • DS: Let’s pause on a final decision--would prefer S3 to S3 connection if this an option --known commodity; ER: Currently doing this for EIP+

      • MC: Would be S3 to Azure connection

      • CDC: How is meta data being communicated when S3 connection? iS THIS Endurable? Do you have a schema? We would be referencing S3 to Azure

        • AIMS: S3 object meta data; not sure if this durable (would carry from S3 to Azure blob store?)

        • AIMS: may vary across projects, but believe there are common S3 schema/metadata

      • CDC:

      • CDC: Every transmission has destination ID, and event type (required); can require additional meta data based on programs

      • Need way to ensure meta data is transferred from S3 to Azure blob

      • AIMS: Azure blob store has S3 schema

      • ER: Only doing this in test environment, not certain if we receive meta data; if api is used, may not get jurisdiction meta data, which we would need--will need to consider these details

    • AIMS: Need to do further research on preferred

    • ER: Has test messages available--can share with DEX Team (previously

    • Next steps:

      • Dari/aims to research S3 to Azure mechanism, determine options as to what is possible

        • GM: Validate both Azure and data api upload documentation

        • RH: Will share Data API documentation

      • Erroll has previously shared test files with marion

      • Gretl to schedule meeting for same timeslot, 2 weeks--5/11

Questions

  1.  

Next Steps & Action Items

  •  

Action items

  • Dari/aims to research S3 to Azure mechanism, determine options as to what is possible

    • GM: Validate both Azure and data api upload documentation

    • RH: Will share Data API documentation

  • Gretl to schedule meeting for same timeslot, 2 weeks--5/11

Quick decisions not requiring context or tracking

For quick, smaller decisions that do not require extra context or formal tracking, use the “Add a decision…” function here.

Decisions requiring context or tracking

For decisions that require more context (e.g., documentation of discussion, options considered) and/or tracking, use the decision template to capture more information.