You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 52 Next »



Marquez Monthly Community Meeting

The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.

June 23, 2022

Agenda:

  • Announcements
  • Recent release: 0.23.0
  • User story by Martin Fiser (Keboola)
  • Open discussion

Meeting:

Notes:

May 26, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
  • Sam Holmberg, Software Engineer, Astronomer
  • Dako Dakov, R&D Manager, VMware
  • Agita Jaunzeme, Community Manager, VMware
  • Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
  • Gage Russell, Data Engineer, Q2
  • Rae Green, Developer, Q2ebanking
  • Dimira Petrova, Supervisor of Data Analytics, VMware
  • Martin Fiser, Head of Professional Services, Keboola
  • Naga Raghavarapu, Principal Software Engineer, Oracle
  • Antoni Ivanov, Staff Engineer, VMware

Agenda:

  • Announcements
  • Use cases from Northwestern Mutual and VMware
  • New feature: linking job runs and datasets

Meeting:

Notes:

  • Announcements [Willy]

  • Northwestern Mutual Use Case [Joshua]

    • Big-picture role of Mqz at NWM
      • Mqz used to track data usage as a whole
      • Mqz critical at NWM to data ops, has special future here
    • Company background
      • Massive insurance co. with investment management arm
      • 150+ history with many customer touch points
      • Massive data with lots of users
    • Rationale for adoption
      • OL is where I spend most of my time
      • These tools will be the industry standards for dataset usage going forward
      • We desired one data standard, not random internal standards
    • Breakdown of use case
      • We track the HOW of usage from initial consumption to end usage
      • We record data product usage over time
      • Bonus: improved security
        • can see how/which users are actually using data
        • allows comparison to security frameworks, double-checking of work
      • Visualization is key
        • helps in building reports and modeling huge data systems
        • we can check the entire platform stack from ingest to updates, normalization, end-usage
    • Personal perspective
      • Mqz is data ops for data processing
      • Will we have a data ops center in the future like we have currently with NOCs?
      • The visual language is the key strength of the tool
      • This is the future of data
    • Q & A
      • Are screenshots available? Do you use Spark? [Naga]
        • Can't share due to proprietary concerns
      • How much data? [Naga]
        • Can't be specific, but it's a lot!
      • It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
        • Yes, custom integrations support streaming and ingestions across the platform
  • VMware use case [Antoni]

    • Demo of VDK
    • Our motivation
      • Verification problems
      • OLMqz was the solution
      • The common standard provided by OL is essential
    • Why Mqz?
      • It's helpful in debugging complex jobs, troubleshooting
      • It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
      • The shared metadata is useful
    • Diagram of architecture
    • Code demo
    • Suggestions
      • Add visualization of parent/child relationships [note: see PR 1935]
      • Make output searchable by metadata (e.g., make it possible to find all late jobs)
    • Our stack
      • Postgres, Presto, Snowflake, Greenplum db, Trino
    • Q & A
      • How many integrations in use? [Gage]
        • 100 teams, 1000s of tables
      • Are you using the Python client? [Willy]
        • Yes
      • It's amazing to get this feedback [Willy]
      • The grouping of jobs is hard, but we're addressing this
      • Feel free to open issues and contribute
  • New feature linking job runs to datasets [Peter]
    • Recently added to jobs: created_by available on dataset views
    • Dataset versions also now available on version history tab
      • Allows for historical introspection in case of an issue
        • Allows for seeing if the code changed, for example
  • Open discussion
    • Is anyone using the Python client for OL? [Gage]
      • Based on today's discussion, the answer is yes
    • Projects, docs are coming [Willy]
      • You can also use the Airflow integration for insight into the Python client
    • Column-level lineage has been added to OL [Willy]
      • We worked with Microsoft on the spec
      • Look for this in the API in the next few months
      • Feedback on this appreciated
    • What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
      • For every event, route it through Kafka –  we're working with a company to help us document this a bit more [Willy]
      • Alternate approach: use a namespace to add metadata
      • Issue with this: access control (see the project roadmap for more info) 

April 28, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Software Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer

And:

  • Ross Turk, Senior Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Gage Russell, Data Engineer, Q2
  • Paweł Leszczyński, Data Engineer, GetInData
  • Joshua Wankowski, Associate Data Engineer, Northwest Mutual
  • Dillon Stadther

Agenda:

  • 0.22.0 preview [Willy]
  • lifecycleStateChange support [Pawel]
  • Updates to job renaming and symlinking [Michael C.]

Meeting:

Notes:

  • Announcements [Willy]:

  • 0.22.0 Preview [Willy]:

    • lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
    • Pawel:
      • change motivated by desire for more information about datasets
      • approach started out with the Spark integration
      • still more information about lifecycle changes is possible/desirable
      • additional feature idea: notification console friendly to backend developers
    • Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
    • Pawel: panel on website could display changes to dataset over X days
      • Agreed. Create an issue and we can build on that idea.
    • Helm chart addition
      • allows annotations, e.g. Prometheus metrics
    • Support for renaming and redirection
      • introducing job hierarchy
      • symlink will permit visibility into name changes to datasets
  • Updates to job renaming and symlinking [Michael C.]

    • stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
    • challenge: linking old jobs to fully qualified version
    • motivating factor: changes to job names results in junk nodes on graph
    • there was no way to remove the old job names from the graph
    • but there is frequently a need to keep track of old job names
    • hence the idea of symlinking a job
    • currently there's no API to do this
    • updating must be done manually currently
      • add the UUID of the new job to the db
      • from that point on, the job history will redirect to the new job (with a 301)
    • future: API will make this possible programmatically
    • Willy: is documentation needed for this?
      • Yes, I will post a change to the README
      • We want to do the same thing for datasets
  • Open discussion

    • Gage: is a home repo coming?
      • Willy: Minkyu has looked into this
      • Willy: we want to add the Helm chart to the new website
      • Willy: this is on our radar
    • New release coming soon!

March 31, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Astronomer
  • Julien Le Dem, Chief Architect, Astronomer
  • Peter Hicks, Senior Engineer, Astronomer

And:

  • Ross Turk, Sr. Director of Community, Astronomer
  • Minkyu Park, Senior Engineer, Astronomer
  • John Thomas, Support Engineer, Astronomer
  • Michael Robinson, Developer Relations Engineer, Astronomer
  • Howard Yoo, Staff Product Manager, Astronomer

Agenda:

  • Website update
  • Backlog and roadmap discussion
  • Open discussion

Meeting:

Notes:

Announcements [Michael R.]

  • Marquez stickers are now available: https://www.astronomer.io/datakin-swag
  • Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
  • The project's Github star count stands at 983. Have you starred the project yet?
  • 1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.

Website [Ross]

  • The project now has a new website.
  • Appropriately, it's an open-source project; PRs are welcome.
  • Tech: Gatsby, Github Projects
  • Dev: run yarn deploy to work on it
  • Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.

Backlog and roadmap [Willy]

  • Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
  • How to get the broader community involved? Want people to have more input/control over the issues we take up.
  • Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
  • Process: review issues on monthly basis, move to roadmap, then release.
  • Question from Howard about how to propose new features
  • Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
  • Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.

Open discussion

  • Michael C.: please note issue #1928: supporting job grouping and hierarchy.
    • Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
    • Proposal: a parent_job_id column should be added to the jobs table and to the runs table, both being uuids. 
  • Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.

February 24, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Michael Collado, Staff Engineer, Datakin

And:

  • Minkyu Park, Senior Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin
  • Ross Turk, VP of Marketing, Datakin

Agenda:

  • Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
  • Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
  • Demo: OL Java client
  • Dataset lifecycle management
  • Open discussion

Meeting:

Notes:

Announcements [Willy]

  • Release date of 0.21.0 is now 2/28
  • Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
    • We hope to have this merged for the next OL release

Integrations and OL demo [Willy]

  • OL integration
    • Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
    • Requirements.txt needs to install airflow
    • Set OpenLineage URL to local instance of Marquez
    • Marquez is moving towards using a task listener to pull metadata in real time 
    • For now use the OL Airflow DAG
    • You can still use the OL backend; there are limitations there, however
  • Spark integration
    • When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)

    • Point the host to your deployment

    • See the OL website for more details (openlineage.io/integration/spark-spark)
  • Upcoming: Flink and Kafka
  • Your feedback on these integrations appreciated
  • There are many connections you can use in your platform by switching over to OL to collect metadata

OL Java client demo [Willy]

Dataset lifecycle management [Willy]

  • Marquez can now capture changes to dataset names
  • Community voiced desire for this feature
  • Marquez now supports soft deletes of datasets
  • See PR #1847
  • Support of lifecycle now more concrete: can see the phases datasets go through

Open discussion

  • Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
  • Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark 

January 27, 2022

Attendees:

TSC:

  • Willy Lulciuc, Co-creator of Marquez
  • Julien Le Dem, CTO of Datakin
  • Michael Collado, Staff Engineer, Datakin
  • Peter Hicks, Senior Engineer, Datakin
  • Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual

And:

  • Ross Turk, VP of Marketing, Datakin
  • Minkyu Park, Senior Engineer, Datakin
  • John Thomas, Support Engineer, Datakin
  • Michael Robinson, Developer Relations Engineer, Datakin

Agenda:

  • Marquez recent releases overview [Willy] 
    • Marquez release 0.21.0 overview
      • Upgrade to Java17
  • Migrating integrations to OpenLineage [Willy]
  • Cloud-based development instance of Marquez via Gitpod [Peter]
  • Open discussion

Meeting:

Notes:

0.21.0 overview [Willy]

  • Features:
    • Bug fixes
    • Removal of excess code
    • Upgrade to Java17
      • API image migrated
      • Eclipse Temurin integrated
      • All CI deployment updated to support Java17
  • Discussion [Kevin, Willy, Michael C.]:
    • Support for Java client possible in lower version
    • Proposed: schedule separate meeting about this

Migrating integrations to OpenLineage [Willy]

  • Spark library in Marquez now deprecated
  • Use of OpenLineage Spark integration recommended going forward
    • review the docs about how to configure your instance
    • remember to add underscore to marquez_airflow
  • OpenLineage integration allows task listener
    • workaround: import DAG from OpenLineage
  • See the changelog: environment variables for the Airflow instance have changed

Cloud-based development instance of Marquez [Peter]

  • Enabled by integration of Gitpod
  • Docker image in the cloud with Marquez and UI
  • Ideal for those not ready to install everything locally or who are having issues with their OS
  • Fast (30 seconds), eliminates risk
  • API also available
  • Can be made private or public
  • Big advantage: shareable within organizations via URL
  • Supports everything one could do locally in VS Code or similar IDE
  • Discussion [Willy, Peter, Kevin, Julien]:
    • common use case: potential users want to see metadata from their org and share the tool
    • potential side-effect: increase in Docker pulls
    • availability of metrics unknown
    • email address required

Open Discussion

  • Advantages of possible move from CircleCI to Github Actions 
    • CircleCI downsides: outages, billing issues [Willy]
    • Julien proposed: moving to Github actions eventually after running both in parallel
    • Kevin asked to experiment with Github Actions and report back
  • Issue #1800: add support for table operations reported from OpenLineage
    • Formal solution needed [Willy]
    • Willy proposed: deploy in two modes and use flags (Julien agreed)
  • NodeID
    • An easy win: add a field that returns a nodeID [Willy]
    • Willy proposed: prioritize in next release

Marquez Workflow Group Calendar Overview

Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars



Marquez Meetings List

Schedule

Title

Owner

Subgroup (mail list)

Purpose

Dial In Link

Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone)Meeting Title (Zoom Account Used)

Meeting Owner/Moderator

marquez-mail-list@lists.lfai.foundation


Meeting Purpose


Zoom Name: https://zoom.us/...
















Marquez Group Calendar 

  1. EDIT THE CALENDAR

    Customise the different types of events you'd like to manage in this calendar.

    #legIndex/#totalLegs
  2. RESTRICT THE CALENDAR

    Optionally, restrict who can view or add events to the team calendar.

    #legIndex/#totalLegs
  3. SHARE WITH YOUR TEAM

    Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.

    #legIndex/#totalLegs
  4. ADD AN EVENT

    The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.

    #legIndex/#totalLegs
  5. SUBSCRIBE

    Subscribe to calendars using your favourite calendar client.

    #legIndex/#totalLegs

  • No labels