5/11/23 CELR Data Lake Migration Meeting notes

Date

May 11, 2023

Attendees:

APHL

CDC

Peraton

APHL

CDC

Peraton

Dari Shirazi: X

Megan Light: X

Erroll Rosser: X

Vanessa Holley :

Teresa Jue: --

Kristin Peterson :

Brooke Beaulieu: X

Cheri Gatland-Lightener: --

Tom Russell: --

Mel Kourbage:

Norris Kpamegan : --

Marcelo Caldas: X

Gretl Glick: X

Ryan Harrison: X

Don Lindsay: X

Geo Miller: X

 

Marion ? : X

Alissa McShane: X

 

 

Goals

  • Update on Data Lake Migration Status

  • DEX Overview and Status Update

Discussion topics

Item

Notes

Item

Notes

Overall Status Updates

  • DEX Development, Status Updates, and Timeline:

    • Upload API: Working ON v1.1-In progress, status of upload

      • V1.1.2-Will work on authenticating via SAMS SSO

      • Milestones on progress

  • Reviewing AIMS connection to DEX options: Research/Analysis: In Progress

    • Upload API - AIMS has reviewed docs, but have not yet requested to be added to the DEX activity.

      • Pro: resumable technology looks like it solves a the problem of supporting large file uploads to web apis.

      • Con: is that it would be a client agent within AIMS that would need continual maintenance and also the ability to track success/failure (queue/dead letter) and team prepared to debug if certificates, credentials, etc change.

    • AWS data sync to azure blob storage -

      • Background: the integration for azure blob storage is notably a “preview” feature on AWS’s side.

      • Pro: job would be configured on each side with no code to maintain.

      • Con: A downside is that this is a job that has to run on some cron frequency - so data flow will be in chunks as opposed to a “real time” stream.

    • Alternative research: potential other project’s integration into azure using Azure Connectors for SQS and S3 to see if it would apply to this problem. This would match how AIMS already internally handles event driven processing & would inherit what the team already knows how to use.

      • Geo is working on proto-typing this option and will have an update next time we meet

      • Need to investigate the effort on the Azure side to make configurations and/or services to use these connectors.

    • Research in progress, AIMS has not requested access

    • Other Options: Will continue to research

Mtg Notes:

  • CDC: Any needs from DEX team?

    • AIMS: Need to request access to evaluate options; Azure connections, could test; AWS Data sync--concerns re: Chunking data and setting up EC2 instance; have not continued to testing this option, but can do so

    • Timeline: Staging data streams set up: ER: Currently in dev, working on QA env [AKA ONB env] by end of Q3 (Late June/July 2023) --[Peraton working on contract extension post 7/23/23]

    • CDC: So not quite ready to receive messages currently, for next ~6 weeks

    • Rough timeline: July 2023--but would like to test with dummy data

    • CDC: 2 week Sprint: Decide on transport option; Next sprint: test dummy data; 3rd sprint: prod data in Staging: ~July 2023

Questions

  1. CDC: Azure Connectors for SQS and S3 to Zaure BLOB: How is meta data communicated to CDC?

    1. Inside of SQS, would provide pointer of object being created, would download from S3--would need to research if meta data would be copied as part of step--need to determine if viable option

    2. CDC: Flagging this--need enough meta data to infer meta data for routing to appropriate program (long-term transport needs to factor in meta data communicated/retention for routing)--is meta data in SQS?

      1. AIMS/: Could enhance SQS meta data; currently in S3 would need to copy to blob storage/ or have 2 different sections of object

      2. AIMS: Would need to run on DEX side, correct? SQS Event from AIMS, DEX would be consuming queue and downloading from S3 (portion which is running is on DEX Side)--would be using Azure connectors on DEX side; would configure Azure connector on DEX side, consume queue, and download data from S3--meta data--can confirm

  2. ER: Vision of frequency of data ingestion from DEX? CDC: Near real-time pull from SQS/S3 in this option (and in other options); latency of less than 5 minutes

    1. AIMS: Con for data sync would rely on cron job, so less than real-time, but will continue to research if it can be near real-time; Timeliness--Azure connector would be near real-time

      1. Other project has used this on AIMS, so will research (EIP+?); ER: Currently using test environment--upload to EIP, goes through S3 into Azure blob, uncertain of transport; MC: EIP+Had to create logic app, and ?, some challenges; Saxion? brings data from AWS into CDC data hub

  3. Is the solution being provided a long-term, broad solution?

    1. CDC: Yes, would hope to use this solution for other data streams, would need to conduct performance test to confirm

    2. AIMS: Would like to leverage use of meta data to migrate older data streams to CDC (e.g. PHINMS reporting streams); or other Data Lakes

    3. ER: Expand Meta Data requirements for other program streams migrations

      1. Current Data Lakes on AIMS:

        1. CELR: Lab Data: Close integration with DEX & CELR Migration teams

        2. EIP+M: Case surveillance Data (Includes FDD MMG (Case V2) & HAI MDRO MMG--Case V2) [Data comes to CDC/MVPS, then is routed to AIMS for storage]

          1. Working with programs

        3. DAART: Antimicrobial Resistance Lab Network Lab Data: Will need to enhance working relationships

    4. CDC: Intent is to identify meta data for transport mechanism, not the actual meta data keys needed for migrating other programs

  4. CDC: Do any of these use PHINMS? And do we have a list of PHINMS Connections?

    1. None of the projects use PHINMS

    2. AIMS: Yes, have a list of PHINMS Connections, but limited progress for retiring PHINMS

      1. CDC: PHINMS retirement strategy--unaware of specific CDC point of contact; AIMS: Have been working with CDC DMB group (MVPS/Joseph Mai xmk0@cdc.gov)--Flu data sent by PHINMS, AMD (4 Mirth channels from AIMS > CDC); fairly large volume of data flowing from AIMS to CDC via PHINMS

      2. DEX: Will need to identify CDC groups to create migration plan for each route --(Potentially Janie Williams is PM for PHINMS?)

  5. DS: Does Geo need additional Azure resources for prototyping? GM: Yes, would be good to have additional Azure resources, but long-term subscription would be NTH; RH: If GM has a CDC account (badged, CDC Email--would be able to request SU account and can request Azure subscriptions), easy to add; more challenging to add if he does not have CDC account) [Dari has CDC account, but needs to renew it]

    1. DS: Can have APHL provide paid Azure subscription

  6. Dex Dev/Migration Phases:

    1. Target: CELR Migration--run messages from pipeline --more immediate: July 2023

    2. Target: DEX transport: Pipeline creation: (dependent on routing mechanisms/meta data inclusion within CDC to programs being created)

Next Steps & Action Items

  • Next steps:

    1. @Geoffery Miller Continue to research options, decide on option, AIMS will create data flow diagram

    2. @dari.shirazi@aphl.org @Alissa McShane Provide Geo with paid Azure subscription for prototype/testing

    3. @Gretl Glick Next Meeting: Schedule: June 1, 2023 at 1pm ET

  • Previous Action Items-4/27/23:

    • Dari/aims to research S3 to Azure mechanism, determine options as to what is possible: In progress

      • GM: Validate both Azure and data api upload documentation: In progress

      • RH: Will share Data API documentation: Complete

    • Gretl to schedule meeting for same timeslot, 2 weeks--5/11: Complete

Action items

Continue to research options, decide on option, AIMS will create data flow diagram @Geoffery Miller
Provide Geo with paid Azure subscription for prototype/testing @dari.shirazi@aphl.org or @Alissa McShane
@Gretl Glick Next Meeting: Schedule: June 1, 2023 at 1pm ET

Quick decisions not requiring context or tracking

For quick, smaller decisions that do not require extra context or formal tracking, use the “Add a decision…” function here.

Decisions requiring context or tracking

For decisions that require more context (e.g., documentation of discussion, options considered) and/or tracking, use the decision template to capture more information.