Marquez Monthly Community Meeting

The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.

June 23, 2022

Agenda:

Announcements
Recent release: 0.23.0
User story by Martin Fiser (Keboola)
Open discussion

Meeting:

Notes:

Announcements
Recent Release 0.23.0
- Added
  - Update docker-compose.yml: Randomly map postgres db port (#2000, @RNHTTR)
  - Job parent hierarchy (#1935 #1980 #1992, @collado-mike)
- Changed
  - Set default limit for listing datasets and jobs in UI from 2000 to 25 (#2018, @wslulciuc)
- Fixed
  - Return the tag for postgresql to 12.1.0 (#2015, @rossturk)

May 26, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Peter Hicks, Senior Engineer, Astronomer

And:

Ross Turk, Senior Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
Sam Holmberg, Software Engineer, Astronomer
Dako Dakov, R&D Manager, VMware
Agita Jaunzeme, Community Manager, VMware
Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
Gage Russell, Data Engineer, Q2
Rae Green, Developer, Q2ebanking
Dimira Petrova, Supervisor of Data Analytics, VMware
Martin Fiser, Head of Professional Services, Keboola
Naga Raghavarapu, Principal Software Engineer, Oracle
Antoni Ivanov, Staff Engineer, VMware

Agenda:

Announcements
Use cases from Northwestern Mutual and VMware
New feature: linking job runs and datasets

Meeting:

Recording
Password: WMz0&@Gm

Notes:

Announcements [Willy]
- Marquez stickers are now available: https://www.astronomer.io/datakin-swag
- Michael C. is presenting today at Airflow Summit @ 7 pm PT: https://airflowsummit.org/program/
- Willy will be talking Mqz at Open Source Summit in June: https://sched.co/11NgS
Northwestern Mutual Use Case [Joshua]
- Big-picture role of Mqz at NWM
  - Mqz used to track data usage as a whole
  - Mqz critical at NWM to data ops, has special future here
- Company background
  - Massive insurance co. with investment management arm
  - 150+ history with many customer touch points
  - Massive data with lots of users
- Rationale for adoption
  - OL is where I spend most of my time
  - These tools will be the industry standards for dataset usage going forward
  - We desired one data standard, not random internal standards
- Breakdown of use case
  - We track the HOW of usage from initial consumption to end usage
  - We record data product usage over time
  - Bonus: improved security
    - can see how/which users are actually using data
    - allows comparison to security frameworks, double-checking of work
  - Visualization is key
    - helps in building reports and modeling huge data systems
    - we can check the entire platform stack from ingest to updates, normalization, end-usage
- Personal perspective
  - Mqz is data ops for data processing
  - Will we have a data ops center in the future like we have currently with NOCs?
  - The visual language is the key strength of the tool
  - This is the future of data
- Q & A
  - Are screenshots available? Do you use Spark? [Naga]
    - Can't share due to proprietary concerns
  - How much data? [Naga]
    - Can't be specific, but it's a lot!
  - It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
    - Yes, custom integrations support streaming and ingestions across the platform
VMware use case [Antoni]
- Demo of VDK
- Our motivation
  - Verification problems
  - OLMqz was the solution
  - The common standard provided by OL is essential
- Why Mqz?
  - It's helpful in debugging complex jobs, troubleshooting
  - It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
  - The shared metadata is useful
- Diagram of architecture
- Code demo
- Suggestions
  - Add visualization of parent/child relationships [note: see PR 1935]
  - Make output searchable by metadata (e.g., make it possible to find all late jobs)
- Our stack
  - Postgres, Presto, Snowflake, Greenplum db, Trino
- Q & A
  - How many integrations in use? [Gage]
    - 100 teams, 1000s of tables
  - Are you using the Python client? [Willy]
    - Yes
  - It's amazing to get this feedback [Willy]
  - The grouping of jobs is hard, but we're addressing this
  - Feel free to open issues and contribute
New feature linking job runs to datasets [Peter]
- Recently added to jobs: created_by available on dataset views
- Dataset versions also now available on version history tab
  - Allows for historical introspection in case of an issue
    - Allows for seeing if the code changed, for example
Open discussion
- Is anyone using the Python client for OL? [Gage]
  - Based on today's discussion, the answer is yes
- Projects, docs are coming [Willy]
  - You can also use the Airflow integration for insight into the Python client
- Column-level lineage has been added to OL [Willy]
  - We worked with Microsoft on the spec
  - Look for this in the API in the next few months
  - Feedback on this appreciated
- What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
  - For every event, route it through Kafka – we're working with a company to help us document this a bit more [Willy]
  - Alternate approach: use a namespace to add metadata
  - Issue with this: access control (see the project roadmap for more info)

April 28, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Software Engineer, Astronomer
Julien Le Dem, Chief Architect, Astronomer

And:

Ross Turk, Senior Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Gage Russell, Data Engineer, Q2
Paweł Leszczyński, Data Engineer, GetInData
Joshua Wankowski, Associate Data Engineer, Northwest Mutual
Dillon Stadther

Agenda:

0.22.0 preview [Willy]
lifecycleStateChange support [Pawel]
Updates to job renaming and symlinking [Michael C.]

Meeting:

Zoom meeting
Passcode: !G!&h=E7

Notes:

Announcements [Willy]:
- Cool swag is available! https://www.astronomer.io/datakin-swag
- Willy has two talks about Marquez upcoming:
- Airflow Summit: https://airflowsummit.org/program/
- Open Source Summit: https://sched.co/11NgS
0.22.0 Preview [Willy]:
- lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
- Pawel:
  - change motivated by desire for more information about datasets
  - approach started out with the Spark integration
  - still more information about lifecycle changes is possible/desirable
  - additional feature idea: notification console friendly to backend developers
- Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
- Pawel: panel on website could display changes to dataset over X days
  - Agreed. Create an issue and we can build on that idea.
- Helm chart addition
  - allows annotations, e.g. Prometheus metrics
- Support for renaming and redirection
  - introducing job hierarchy
  - symlink will permit visibility into name changes to datasets
Updates to job renaming and symlinking [Michael C.]
- stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
- challenge: linking old jobs to fully qualified version
- motivating factor: changes to job names results in junk nodes on graph
- there was no way to remove the old job names from the graph
- but there is frequently a need to keep track of old job names
- hence the idea of symlinking a job
- currently there's no API to do this
- updating must be done manually currently
  - add the UUID of the new job to the db
  - from that point on, the job history will redirect to the new job (with a 301)
- future: API will make this possible programmatically
- Willy: is documentation needed for this?
  - Yes, I will post a change to the README
  - We want to do the same thing for datasets
Open discussion
- Gage: is a home repo coming?
  - Willy: Minkyu has looked into this
  - Willy: we want to add the Helm chart to the new website
  - Willy: this is on our radar
- New release coming soon!

March 31, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Engineer, Astronomer
Julien Le Dem, Chief Architect, Astronomer
Peter Hicks, Senior Engineer, Astronomer

And:

Ross Turk, Sr. Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Howard Yoo, Staff Product Manager, Astronomer

Agenda:

Website update
Backlog and roadmap discussion
Open discussion

Meeting:

Slides
Zoom recording
Passcode: !4gq*v01

Notes:

Announcements [Michael R.]

Marquez stickers are now available: https://www.astronomer.io/datakin-swag
Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
The project's Github star count stands at 983. Have you starred the project yet?
1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.

Website [Ross]

The project now has a new website.
Appropriately, it's an open-source project; PRs are welcome.
Tech: Gatsby, Github Projects
Dev: run yarn deploy to work on it
Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.

Backlog and roadmap [Willy]

Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
How to get the broader community involved? Want people to have more input/control over the issues we take up.
Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
Process: review issues on monthly basis, move to roadmap, then release.
Question from Howard about how to propose new features
Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.

Open discussion

Michael C.: please note issue #1928: supporting job grouping and hierarchy.
- Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
- Proposal: a parent_job_id column should be added to the jobs table and to the runs table, both being uuids.
Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.

February 24, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Engineer, Datakin

And:

Minkyu Park, Senior Engineer, Datakin
Michael Robinson, Developer Relations Engineer, Datakin
Ross Turk, VP of Marketing, Datakin

Agenda:

Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
Demo: OL Java client
Dataset lifecycle management
Open discussion

Meeting:

Slides
Zoom recording
Passcode: Q89sjr.c

Notes:

Announcements [Willy]

Release date of 0.21.0 is now 2/28
Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
- We hope to have this merged for the next OL release

Integrations and OL demo [Willy]

OL integration
- Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
- Requirements.txt needs to install airflow
- Set OpenLineage URL to local instance of Marquez
- Marquez is moving towards using a task listener to pull metadata in real time
- For now use the OL Airflow DAG
- You can still use the OL backend; there are limitations there, however
Spark integration
- When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)
- Point the host to your deployment
- See the OL website for more details (openlineage.io/integration/spark-spark)
Upcoming: Flink and Kafka
Your feedback on these integrations appreciated
There are many connections you can use in your platform by switching over to OL to collect metadata

OL Java client demo [Willy]

The Java client employs a workflow with interface
Definition of run method required
Instance of database required
This ex: simpleworkflow with database via newDatabase method
Relies on a Job class
In Marquez you can see the calls
For the code see https://github.com/DatakinHQ/demo/tree/main/custom/java/simple

Dataset lifecycle management [Willy]

Marquez can now capture changes to dataset names
Community voiced desire for this feature
Marquez now supports soft deletes of datasets
See PR #1847
Support of lifecycle now more concrete: can see the phases datasets go through

Open discussion

Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark

January 27, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Julien Le Dem, CTO of Datakin
Michael Collado, Staff Engineer, Datakin
Peter Hicks, Senior Engineer, Datakin
Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual

And:

Ross Turk, VP of Marketing, Datakin
Minkyu Park, Senior Engineer, Datakin
John Thomas, Support Engineer, Datakin
Michael Robinson, Developer Relations Engineer, Datakin

Agenda:

Marquez recent releases overview [Willy]
- Marquez release 0.21.0 overview
  - Upgrade to Java17
Migrating integrations to OpenLineage [Willy]
Cloud-based development instance of Marquez via Gitpod [Peter]
Open discussion

Meeting:

Slides
Zoom recording
Passcode: Ef2^5Wwg

Notes:

0.21.0 overview [Willy]

Features:
- Bug fixes
- Removal of excess code
- Upgrade to Java17
  - API image migrated
  - Eclipse Temurin integrated
  - All CI deployment updated to support Java17
Discussion [Kevin, Willy, Michael C.]:
- Support for Java client possible in lower version
- Proposed: schedule separate meeting about this

Migrating integrations to OpenLineage [Willy]

Spark library in Marquez now deprecated
Use of OpenLineage Spark integration recommended going forward
- review the docs about how to configure your instance
- remember to add underscore to marquez_airflow
OpenLineage integration allows task listener
- workaround: import DAG from OpenLineage
See the changelog: environment variables for the Airflow instance have changed

Cloud-based development instance of Marquez [Peter]

Enabled by integration of Gitpod
Docker image in the cloud with Marquez and UI
Ideal for those not ready to install everything locally or who are having issues with their OS
Fast (30 seconds), eliminates risk
API also available
Can be made private or public
Big advantage: shareable within organizations via URL
Supports everything one could do locally in VS Code or similar IDE
Discussion [Willy, Peter, Kevin, Julien]:
- common use case: potential users want to see metadata from their org and share the tool
- potential side-effect: increase in Docker pulls
- availability of metrics unknown
- email address required

Open Discussion

Advantages of possible move from CircleCI to Github Actions
- CircleCI downsides: outages, billing issues [Willy]
- Julien proposed: moving to Github actions eventually after running both in parallel
- Kevin asked to experiment with Github Actions and report back
Issue #1800: add support for table operations reported from OpenLineage
- Formal solution needed [Willy]
- Willy proposed: deploy in two modes and use flags (Julien agreed)
NodeID
- An easy win: add a field that returns a nodeID [Willy]
- Willy proposed: prioritize in next release

Marquez Workflow Group Calendar Overview

Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:

Be logged into LF AI Foundation Groups.io
Be subscribed to the sub-group(mail-list) you're interested in
Thereafter, you will see all the calendars for the sub-groups you subscribe to under your LF AI Foundation Group Calendar via Groups.io OR
You can also view a specific group calendar via the Wiki (if the group has created a Wiki group calendar) whether you are a member of the sub-group (mail list) or not
- Example: LF AI TAC Group Calendar (tac-general@lists...) via Wiki

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars

Marquez Meetings List

Schedule

Title

Owner

Subgroup (mail list)

Purpose

Dial In Link

Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone)

Meeting Title (Zoom Account Used)

Meeting Owner/Moderator

marquez-mail-list@lists.lfai.foundation

Meeting Purpose

Zoom Name: https://zoom.us/...

Marquez Group Calendar

EDIT THE CALENDAR
Customise the different types of events you'd like to manage in this calendar.
#legIndex/#totalLegs
RESTRICT THE CALENDAR
Optionally, restrict who can view or add events to the team calendar.
#legIndex/#totalLegs
SHARE WITH YOUR TEAM
Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.
#legIndex/#totalLegs
ADD AN EVENT
The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.
#legIndex/#totalLegs
SUBSCRIBE
Subscribe to calendars using your favourite calendar client.
#legIndex/#totalLegs

Space shortcuts

Page tree

Marquez - Community Meetings & Calendar

June 23, 2022

Agenda:

Meeting:

Notes:

Announcements

Recent Release 0.23.0

May 26, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]

Northwestern Mutual Use Case [Joshua]

VMware use case [Antoni]

April 28, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]:

0.22.0 Preview [Willy]:

Updates to job renaming and symlinking [Michael C.]

Open discussion

March 31, 2022

Agenda:

Meeting:

Notes:

February 24, 2022

Attendees:

Agenda:

Meeting:

Notes:

January 27, 2022

Attendees:

TSC:

And:

Agenda:

Meeting:

Notes:

0.21.0 overview [Willy]

Migrating integrations to OpenLineage [Willy]

Cloud-based development instance of Marquez [Peter]

Open Discussion

Marquez Workflow Group Calendar Overview

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars

Marquez Meetings List

Marquez Group Calendar