You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 72 Next »



Marquez Monthly Community Meeting

The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.

August 25, 2022

July 28, 2022

Attendees:

  • TSC:
    • Willy Lulciuc, Co-creator of Marquez
    • Michael Collado, Staff Software Engineer, Astronomer
  • And:
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Minkyu Park, Senior Engineer, Astronomer
    • John Thomas, Software Engineer, Dev. Rel., Astronomer
    • Ross Turk, Senior Director of Community, Astronomer
    • Ryan Hatter, Customer Reliability Engineer, Astronomer
    • Howard Yoo, Staff Product Manager, Astronomer

Agenda:

  1. Announcements
  2. Introducing the Marquez blog
  3. Architecture review: the lineage graph
  4. Discussion
    1. Marquez issue #2048

Meeting:

Notes:

  1. Announcements [Willy]
  2. Introducing the Marquez Blog [Michael R. and Ross]
    1. new blog can be found at marquezproject.ai/blog
    2. designed and built by Ross
    3. to contribute a blog post on GitHub:
      1. write post in Markdown, place it in new directory in OpenLineage/website/contents/blog 
      2. OR: open an issue first to suggest a topic or get feedback on your idea
      3. artwork: Ross happy to make the images; tag him
      4. Ross also happy to document the artwork creation process for others
  3. Architecture review: the lineage graph [Willy]
    1. What is Marquez doing in the background to surface lineage metadata at the run level during execution?
    2. What is a current lineage graph?
      1. bigraph with nodes for jobs and datasets
      2. run-level lineage is collected from OpenLineage events
      3. representation of job is based on datasets and the inputs and outputs they produce
      4. datasets stitched together using OpenLineage `ID` (global and unique)
      5. versioning of jobs enabled by OpenLineage `JobVersion`
        1. Marquez keeps track of changes to code and datasets behind the scenes
    3. Marquez data model
      1. Marquez keeps track of:
        1. job versions
        2. runs of each version
        3. sources
      2. each node represents the latest, or current, version of the job's lineage
      3. `Job` is `ID` and arrays representing input and output datasets
    4. Demo
      1. UI defaults to latest/current graph
      2. prior versions accessible via `version history` tab
      3. selecting a version makes another job node/datasets visible
      4. makes "time travel" possible in your pipeline
      5. all of this possible thanks to the OpenLineage spec
    5. Q & A
      1. If a job has not completed, will you not see metadata? [Howard]
        • no – a job has to complete in order for versioning logic to be applied 
      2. Is a job version associated with the code that produced it? [Ryan]
        • yes – if the code is provided as a source location facet
        • Marquez will determine if the code has changed
        • changes to schema also monitored using dataset versioning; this tied to job version
  4. Discussion
    1. Howard: issue 2048
      1. There is an edge case (using a custom extractor) where the TaskMetadata's given input or output dataset would NOT have the fields populated (`dataset.fields = []`).
      2. Having this type of metadata makes Marquez overwrite the existing version of the dataset with empty fields
      3. Proposal: Marquez should try to reuse the dataset instead of rewriting 
    2. Agreed; question remains about how to do it [Willy]
      1. behavior reflects versioning logic
      2. possible solution: use `null` value in OL spec rather than empty array
      3. challenge: we want to avoid making assumptions

June 23, 2022

Attendees:

  • TSC
    • Willy Lulciuc, Co-creator of Marquez
    • Julien Le Dem, Chief Architect, Astronomer
  • And
    • Martin Fiser, Head of Professional Services, Keboola
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Minkyu Park, Senior Engineer, Astronomer
    • John Thomas, Support Engineer, Astronomer
    • Naga Raghavarapu, Principal Software Engineer, Oracle
    • Ross Turk, Senior Director of Community, Astronomer

Agenda:

  • Announcements
  • Recent release: 0.23.0
  • User story by Martin Fiser (Keboola)
  • Open discussion

Meeting:

Notes:

  • Announcements [Willy]

    • Mqz/OL swag is still available!
    • Willy talked Mqz at OS Summit (LinuxCon)
  • Recent Release 0.23.0 [Michael R.]

  • Keboola Use Case [Martin]

    • Topic: OL integration with the Keboola platform
      • Overview of platform
        • modern data experience: data stack as a service
        • all-in-one service
        • writers/reverse ETL through component framework
        • enables version control, governance, etc., in workspaces
        • much metadata produced and collected, permitting visibility across entire pipeline
          • pipeline jobs
          • storage events
          • data loads/unloads
          • user-generated metadata
      • Purpose of OL integration
        • data governance to support users' feeding data to external tools
        • OL a "language" for speaking to various tools
        • offer API for OL information
        • native Keboola component
          • feeds OL information to an endpoint (e.g., Marquez)
          • can be orchestrated on customizable interval 
          • supports SSH
          • exports full job information to the endpoint
      • Demo
        • users have multiple projects on the platform
        • a few hundred components are offered to users out of the box (e.g., Google Drive, SQL, Python, Google Sheets)
        • metadata manually pushable to OpenLineage endpoint
        • orchestrator could benefit from parent/job support
      • Challenges
        • need: richer metadata 
          • component config
          • info about tables
        • lighter UI
          • reflects feedback about legibility
          • icon customizability
        • namespaces
          • connectivity between projects
        • more integrations
        • rounded logo
      • Q & A
        • Are you interested in contributing? [Julien]
          • would like to; possibly in the future
        • Would you like to open issues? (custom facets, UI) [Willy]
          • not currently able to
        • Are you using any integrations? java or python [Willy]
          • component can be anything in the docker container
          • multiple languages used in development
        • Customers using it already? [Conor]
          • some testing is going on
          • not in production yet
          • no plans to offer Marquez to customers
        • Does it work for every connector? [Conor]
          • each will produce at least a job
        • Auth model [Willy]
          • problem: slippery slope [Martin]
          • recommended at ingress level [Willy]
          • not a focus at the moment
          • contributions to related issues welcome
        • Is data discovery offered? [Naga]
          • built in with API 
          • additional tools can be added if integration would be seamless

May 26, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
  • Sam Holmberg, Software Engineer, Astronomer
  • Dako Dakov, R&D Manager, VMware
  • Agita Jaunzeme, Community Manager, VMware
  • Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
  • Gage Russell, Data Engineer, Q2
  • Rae Green, Developer, Q2ebanking
  • Dimira Petrova, Supervisor of Data Analytics, VMware
  • Martin Fiser, Head of Professional Services, Keboola
  • Naga Raghavarapu, Principal Software Engineer, Oracle
  • Antoni Ivanov, Staff Engineer, VMware

Agenda:

  • Announcements
  • Use cases from Northwestern Mutual and VMware
  • New feature: linking job runs and datasets

Meeting:

Notes:

  • Announcements [Willy]

  • Northwestern Mutual Use Case [Joshua]

    • Big-picture role of Mqz at NWM
      • Mqz used to track data usage as a whole
      • Mqz critical at NWM to data ops, has special future here
    • Company background
      • Massive insurance co. with investment management arm
      • 150+ history with many customer touch points
      • Massive data with lots of users
    • Rationale for adoption
      • OL is where I spend most of my time
      • These tools will be the industry standards for dataset usage going forward
      • We desired one data standard, not random internal standards
    • Breakdown of use case
      • We track the HOW of usage from initial consumption to end usage
      • We record data product usage over time
      • Bonus: improved security
        • can see how/which users are actually using data
        • allows comparison to security frameworks, double-checking of work
      • Visualization is key
        • helps in building reports and modeling huge data systems
        • we can check the entire platform stack from ingest to updates, normalization, end-usage
    • Personal perspective
      • Mqz is data ops for data processing
      • Will we have a data ops center in the future like we have currently with NOCs?
      • The visual language is the key strength of the tool
      • This is the future of data
    • Q & A
      • Are screenshots available? Do you use Spark? [Naga]
        • Can't share due to proprietary concerns
      • How much data? [Naga]
        • Can't be specific, but it's a lot!
      • It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
        • Yes, custom integrations support streaming and ingestions across the platform
  • VMware use case [Antoni]

    • Demo of VDK
    • Our motivation
      • Verification problems
      • OLMqz was the solution
      • The common standard provided by OL is essential
    • Why Mqz?
      • It's helpful in debugging complex jobs, troubleshooting
      • It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
      • The shared metadata is useful
    • Diagram of architecture
    • Code demo
    • Suggestions
      • Add visualization of parent/child relationships [note: see PR 1935]
      • Make output searchable by metadata (e.g., make it possible to find all late jobs)
    • Our stack
      • Postgres, Presto, Snowflake, Greenplum db, Trino
    • Q & A
      • How many integrations in use? [Gage]
        • 100 teams, 1000s of tables
      • Are you using the Python client? [Willy]
        • Yes
      • It's amazing to get this feedback [Willy]
      • The grouping of jobs is hard, but we're addressing this
      • Feel free to open issues and contribute
  • New feature linking job runs to datasets [Peter]
    • Recently added to jobs: created_by available on dataset views
    • Dataset versions also now available on version history tab
      • Allows for historical introspection in case of an issue
        • Allows for seeing if the code changed, for example
  • Open discussion
    • Is anyone using the Python client for OL? [Gage]
      • Based on today's discussion, the answer is yes
    • Projects, docs are coming [Willy]
      • You can also use the Airflow integration for insight into the Python client
    • Column-level lineage has been added to OL [Willy]
      • We worked with Microsoft on the spec
      • Look for this in the API in the next few months
      • Feedback on this appreciated
    • What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
      • For every event, route it through Kafka –  we're working with a company to help us document this a bit more [Willy]
      • Alternate approach: use a namespace to add metadata
      • Issue with this: access control (see the project roadmap for more info) 

April 28, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Software Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Gage Russell, Data Engineer, Q2
  • Paweł Leszczyński, Data Engineer, GetInData
  • Joshua Wankowski, Associate Data Engineer, Northwest Mutual
  • Dillon Stadther

Agenda:

  • 0.22.0 preview [Willy]
  • lifecycleStateChange support [Pawel]
  • Updates to job renaming and symlinking [Michael C.]

Meeting:

Notes:

  • Announcements [Willy]:

  • 0.22.0 Preview [Willy]:

    • lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
    • Pawel:
      • change motivated by desire for more information about datasets
      • approach started out with the Spark integration
      • still more information about lifecycle changes is possible/desirable
      • additional feature idea: notification console friendly to backend developers
    • Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
    • Pawel: panel on website could display changes to dataset over X days
      • Agreed. Create an issue and we can build on that idea.
    • Helm chart addition
      • allows annotations, e.g. Prometheus metrics
    • Support for renaming and redirection
      • introducing job hierarchy
      • symlink will permit visibility into name changes to datasets
  • Updates to job renaming and symlinking [Michael C.]

    • stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
    • challenge: linking old jobs to fully qualified version
    • motivating factor: changes to job names results in junk nodes on graph
    • there was no way to remove the old job names from the graph
    • but there is frequently a need to keep track of old job names
    • hence the idea of symlinking a job
    • currently there's no API to do this
    • updating must be done manually currently
      • add the UUID of the new job to the db
      • from that point on, the job history will redirect to the new job (with a 301)
    • future: API will make this possible programmatically
    • Willy: is documentation needed for this?
      • Yes, I will post a change to the README
      • We want to do the same thing for datasets
  • Open discussion
    • Gage: is a home repo coming?
      • Willy: Minkyu has looked into this
      • Willy: we want to add the Helm chart to the new website
      • Willy: this is on our radar
    • New release coming soon!

March 31, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Sr. Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Howard Yoo, Staff Product Manager, Astronomer

Agenda:

  • Website update
  • Backlog and roadmap discussion
  • Open discussion

Meeting:

Slides

Notes:

Announcements [Michael R.]

  • Marquez stickers are now available: https://www.astronomer.io/datakin-swag
  • Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
  • The project's Github star count stands at 983. Have you starred the project yet?
  • 1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.

Website [Ross]

  • The project now has a new website.
  • Appropriately, it's an open-source project; PRs are welcome.
  • Tech: Gatsby, Github Projects
  • Dev: run yarn deploy to work on it
  • Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.

Backlog and roadmap [Willy]

  • Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
  • How to get the broader community involved? Want people to have more input/control over the issues we take up.
  • Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
  • Process: review issues on monthly basis, move to roadmap, then release.
  • Question from Howard about how to propose new features
  • Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
  • Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.

Open discussion

  • Michael C.: please note issue #1928: supporting job grouping and hierarchy.
    • Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
    • Proposal: a parent_job_id column should be added to the jobs table and to the runs table, both being uuids. 
  • Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.

February 24, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Datakin

And:

  • Minkyu Park, Senior Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin
  • Ross Turk, VP of Marketing, Datakin

Agenda:

  • Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
  • Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
  • Demo: OL Java client
  • Dataset lifecycle management
  • Open discussion

Meeting:

Slides

Notes:

  • Announcements [Willy]

    • Release date of 0.21.0 is now 2/28
    • Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
      • We hope to have this merged for the next OL release
  • Integrations and OL demo [Willy]

    • OL integration
      • Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
      • Requirements.txt needs to install airflow
      • Set OpenLineage URL to local instance of Marquez
      • Marquez is moving towards using a task listener to pull metadata in real time 
      • For now use the OL Airflow DAG
      • You can still use the OL backend; there are limitations there, however
    • Spark integration
      • When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)

      • Point the host to your deployment

      • See the OL website for more details (openlineage.io/integration/spark-spark)
    • Upcoming: Flink and Kafka
    • Your feedback on these integrations appreciated
    • There are many connections you can use in your platform by switching over to OL to collect metadata
  • OL Java client demo [Willy]

  • Dataset lifecycle management [Willy]

    • Marquez can now capture changes to dataset names
    • Community voiced desire for this feature
    • Marquez now supports soft deletes of datasets
    • See PR #1847
    • Support of lifecycle now more concrete: can see the phases datasets go through
  • Open discussion

    • Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
    • Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark 

January 27, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Julien Le Dem, CTO of Datakin
  • Michael Collado, Staff Engineer, Datakin
  • Peter Hicks, Senior Engineer, Datakin
  • Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual

And:

  • Ross Turk, VP of Marketing, Datakin
  • Minkyu Park, Senior Engineer, Datakin
  • John Thomas, Support Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin

Agenda:

  • Marquez recent releases overview [Willy] 
    • Marquez release 0.21.0 overview
      • Upgrade to Java17
  • Migrating integrations to OpenLineage [Willy]
  • Cloud-based development instance of Marquez via Gitpod [Peter]
  • Open discussion

Meeting:

Slides

Notes:

  • 0.21.0 overview [Willy]

    • Features:
      • Bug fixes
      • Removal of excess code
      • Upgrade to Java17
        • API image migrated
        • Eclipse Temurin integrated
        • All CI deployment updated to support Java17
    • Discussion [Kevin, Willy, Michael C.]:
      • Support for Java client possible in lower version
      • Proposed: schedule separate meeting about this
  • Migrating integrations to OpenLineage [Willy]

    • Spark library in Marquez now deprecated
    • Use of OpenLineage Spark integration recommended going forward
      • review the docs about how to configure your instance
      • remember to add underscore to marquez_airflow
    • OpenLineage integration allows task listener
      • workaround: import DAG from OpenLineage
    • See the changelog: environment variables for the Airflow instance have changed
  • Cloud-based development instance of Marquez [Peter]

    • Enabled by integration of Gitpod
    • Docker image in the cloud with Marquez and UI
    • Ideal for those not ready to install everything locally or who are having issues with their OS
    • Fast (30 seconds), eliminates risk
    • API also available
    • Can be made private or public
    • Big advantage: shareable within organizations via URL
    • Supports everything one could do locally in VS Code or similar IDE
    • Discussion [Willy, Peter, Kevin, Julien]:
      • common use case: potential users want to see metadata from their org and share the tool
      • potential side-effect: increase in Docker pulls
      • availability of metrics unknown
      • email address required
  • Open Discussion

    • Advantages of possible move from CircleCI to Github Actions 
      • CircleCI downsides: outages, billing issues [Willy]
      • Julien proposed: moving to Github actions eventually after running both in parallel
      • Kevin asked to experiment with Github Actions and report back
    • Issue #1800: add support for table operations reported from OpenLineage
      • Formal solution needed [Willy]
      • Willy proposed: deploy in two modes and use flags (Julien agreed)
    • NodeID
      • An easy win: add a field that returns a nodeID [Willy]
      • Willy proposed: prioritize in next release

Marquez Workflow Group Calendar Overview

Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars



Marquez Meetings List

Schedule

Title

Owner

Subgroup (mail list)

Purpose

Dial In Link

Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone)Meeting Title (Zoom Account Used)

Meeting Owner/Moderator

marquez-mail-list@lists.lfai.foundation


Meeting Purpose


Zoom Name: https://zoom.us/...
















Marquez Group Calendar 

  1. EDIT THE CALENDAR

    Customise the different types of events you'd like to manage in this calendar.

    #legIndex/#totalLegs
  2. RESTRICT THE CALENDAR

    Optionally, restrict who can view or add events to the team calendar.

    #legIndex/#totalLegs
  3. SHARE WITH YOUR TEAM

    Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.

    #legIndex/#totalLegs
  4. ADD AN EVENT

    The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.

    #legIndex/#totalLegs
  5. SUBSCRIBE

    Subscribe to calendars using your favourite calendar client.

    #legIndex/#totalLegs

  • No labels