Marquez Monthly Community Meeting

The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.

August 25, 2022

July 28, 2022

Attendees:

TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
And:
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Ross Turk, Senior Director of Community, Astronomer
- Ryan Hatter, Customer Reliability Engineer, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer

Agenda:

Announcements
Introducing the Marquez blog
Architecture review: the lineage graph
Discussion
1. Marquez issue #2048

Meeting:

Notes:

Announcements [Willy]
Introducing the Marquez Blog [Michael R. and Ross]
1. new blog can be found at marquezproject.ai/blog
2. designed and built by Ross
3. to contribute a blog post on GitHub:
  1. write post in Markdown, place it in new directory in OpenLineage/website/contents/blog
  2. OR: open an issue first to suggest a topic or get feedback on your idea
  3. artwork: Ross happy to make the images; tag him
  4. Ross also happy to document the artwork creation process for others
Architecture review: the lineage graph [Willy]
1. What is Marquez doing in the background to surface lineage metadata at the run level during execution?
2. What is a current lineage graph?
  1. bigraph with nodes for jobs and datasets
  2. run-level lineage is collected from OpenLineage events
  3. representation of job is based on datasets and the inputs and outputs they produce
  4. datasets stitched together using OpenLineage `ID` (global and unique)
  5. versioning of jobs enabled by OpenLineage `JobVersion`
    1. Marquez keeps track of changes to code and datasets behind the scenes
3. Marquez data model
  1. Marquez keeps track of:
    1. job versions
    2. runs of each version
    3. sources
  2. each node represents the latest, or current, version of the job's lineage
  3. `Job` is `ID` and arrays representing input and output datasets
4. Demo
  1. UI defaults to latest/current graph
  2. prior versions accessible via `version history` tab
  3. selecting a version makes another job node/datasets visible
  4. makes "time travel" possible in your pipeline
  5. all of this possible thanks to the OpenLineage spec
5. Q & A
  1. If a job has not completed, will you not see metadata? [Howard]
    - no – a job has to complete in order for versioning logic to be applied
  2. Is a job version associated with the code that produced it? [Ryan]
    - yes – if the code is provided as a source location facet
    - Marquez will determine if the code has changed
    - changes to schema also monitored using dataset versioning; this tied to job version
Discussion
1. Howard: issue 2048:
  1. There is an edge case (using a custom extractor) where the TaskMetadata's given input or output dataset would NOT have the fields populated (`dataset.fields = []`).
  2. Having this type of metadata makes Marquez overwrite the existing version of the dataset with empty fields
  3. Proposal: Marquez should try to reuse the dataset instead of rewriting
2. Agreed; question remains about how to do it [Willy]
  1. behavior reflects versioning logic
  2. possible solution: use `null` value in OL spec rather than empty array
  3. challenge: we want to avoid making assumptions

June 23, 2022

Attendees:

TSC
- Willy Lulciuc, Co-creator of Marquez
- Julien Le Dem, Chief Architect, Astronomer
And
- Martin Fiser, Head of Professional Services, Keboola
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Naga Raghavarapu, Principal Software Engineer, Oracle
- Ross Turk, Senior Director of Community, Astronomer

Agenda:

Announcements
Recent release: 0.23.0
User story by Martin Fiser (Keboola)
Open discussion

Meeting:

Notes:

Announcements [Willy]
- Mqz/OL swag is still available!
- Willy talked Mqz at OS Summit (LinuxCon)
Recent Release 0.23.0 [Michael R.]
- Added
  - Update docker-compose.yml: Randomly map postgres db port (#2000, @RNHTTR)
  - Job parent hierarchy (#1935 #1980 #1992, @collado-mike)
- Changed
  - Set default limit for listing datasets and jobs in UI from 2000 to 25 (#2018, @wslulciuc)
- Fixed
  - Return the tag for postgresql to 12.1.0 (#2015, @rossturk)
Keboola Use Case [Martin]
- Topic: OL integration with the Keboola platform
  - Overview of platform
    - modern data experience: data stack as a service
    - all-in-one service
    - writers/reverse ETL through component framework
    - enables version control, governance, etc., in workspaces
    - much metadata produced and collected, permitting visibility across entire pipeline
      - pipeline jobs
      - storage events
      - data loads/unloads
      - user-generated metadata
  - Purpose of OL integration
    - data governance to support users' feeding data to external tools
    - OL a "language" for speaking to various tools
    - offer API for OL information
    - native Keboola component
      - feeds OL information to an endpoint (e.g., Marquez)
      - can be orchestrated on customizable interval
      - supports SSH
      - exports full job information to the endpoint
  - Demo
    - users have multiple projects on the platform
    - a few hundred components are offered to users out of the box (e.g., Google Drive, SQL, Python, Google Sheets)
    - metadata manually pushable to OpenLineage endpoint
    - orchestrator could benefit from parent/job support
  - Challenges
    - need: richer metadata
      - component config
      - info about tables
    - lighter UI
      - reflects feedback about legibility
      - icon customizability
    - namespaces
      - connectivity between projects
    - more integrations
    - rounded logo
  - Q & A
    - Are you interested in contributing? [Julien]
      - would like to; possibly in the future
    - Would you like to open issues? (custom facets, UI) [Willy]
      - not currently able to
    - Are you using any integrations? java or python [Willy]
      - component can be anything in the docker container
      - multiple languages used in development
    - Customers using it already? [Conor]
      - some testing is going on
      - not in production yet
      - no plans to offer Marquez to customers
    - Does it work for every connector? [Conor]
      - each will produce at least a job
    - Auth model [Willy]
      - problem: slippery slope [Martin]
      - recommended at ingress level [Willy]
      - not a focus at the moment
      - contributions to related issues welcome
    - Is data discovery offered? [Naga]
      - built in with API
      - additional tools can be added if integration would be seamless

May 26, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Peter Hicks, Senior Engineer, Astronomer

And:

Ross Turk, Senior Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
Sam Holmberg, Software Engineer, Astronomer
Dako Dakov, R&D Manager, VMware
Agita Jaunzeme, Community Manager, VMware
Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
Gage Russell, Data Engineer, Q2
Rae Green, Developer, Q2ebanking
Dimira Petrova, Supervisor of Data Analytics, VMware
Martin Fiser, Head of Professional Services, Keboola
Naga Raghavarapu, Principal Software Engineer, Oracle
Antoni Ivanov, Staff Engineer, VMware

Agenda:

Announcements
Use cases from Northwestern Mutual and VMware
New feature: linking job runs and datasets

Meeting:

Recording
Password: WMz0&@Gm

Notes:

Announcements [Willy]
- Marquez stickers are now available: https://www.astronomer.io/datakin-swag
- Michael C. is presenting today at Airflow Summit @ 7 pm PT: https://airflowsummit.org/program/
- Willy will be talking Mqz at Open Source Summit in June: https://sched.co/11NgS
Northwestern Mutual Use Case [Joshua]
- Big-picture role of Mqz at NWM
  - Mqz used to track data usage as a whole
  - Mqz critical at NWM to data ops, has special future here
- Company background
  - Massive insurance co. with investment management arm
  - 150+ history with many customer touch points
  - Massive data with lots of users
- Rationale for adoption
  - OL is where I spend most of my time
  - These tools will be the industry standards for dataset usage going forward
  - We desired one data standard, not random internal standards
- Breakdown of use case
  - We track the HOW of usage from initial consumption to end usage
  - We record data product usage over time
  - Bonus: improved security
    - can see how/which users are actually using data
    - allows comparison to security frameworks, double-checking of work
  - Visualization is key
    - helps in building reports and modeling huge data systems
    - we can check the entire platform stack from ingest to updates, normalization, end-usage
- Personal perspective
  - Mqz is data ops for data processing
  - Will we have a data ops center in the future like we have currently with NOCs?
  - The visual language is the key strength of the tool
  - This is the future of data
- Q & A
  - Are screenshots available? Do you use Spark? [Naga]
    - Can't share due to proprietary concerns
  - How much data? [Naga]
    - Can't be specific, but it's a lot!
  - It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
    - Yes, custom integrations support streaming and ingestions across the platform
VMware use case [Antoni]
- Demo of VDK
- Our motivation
  - Verification problems
  - OLMqz was the solution
  - The common standard provided by OL is essential
- Why Mqz?
  - It's helpful in debugging complex jobs, troubleshooting
  - It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
  - The shared metadata is useful
- Diagram of architecture
- Code demo
- Suggestions
  - Add visualization of parent/child relationships [note: see PR 1935]
  - Make output searchable by metadata (e.g., make it possible to find all late jobs)
- Our stack
  - Postgres, Presto, Snowflake, Greenplum db, Trino
- Q & A
  - How many integrations in use? [Gage]
    - 100 teams, 1000s of tables
  - Are you using the Python client? [Willy]
    - Yes
  - It's amazing to get this feedback [Willy]
  - The grouping of jobs is hard, but we're addressing this
  - Feel free to open issues and contribute
New feature linking job runs to datasets [Peter]
- Recently added to jobs: created_by available on dataset views
- Dataset versions also now available on version history tab
  - Allows for historical introspection in case of an issue
    - Allows for seeing if the code changed, for example
Open discussion
- Is anyone using the Python client for OL? [Gage]
  - Based on today's discussion, the answer is yes
- Projects, docs are coming [Willy]
  - You can also use the Airflow integration for insight into the Python client
- Column-level lineage has been added to OL [Willy]
  - We worked with Microsoft on the spec
  - Look for this in the API in the next few months
  - Feedback on this appreciated
- What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
  - For every event, route it through Kafka – we're working with a company to help us document this a bit more [Willy]
  - Alternate approach: use a namespace to add metadata
  - Issue with this: access control (see the project roadmap for more info)

April 28, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Software Engineer, Astronomer
Julien Le Dem, Chief Architect, Astronomer

And:

Ross Turk, Senior Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Gage Russell, Data Engineer, Q2
Paweł Leszczyński, Data Engineer, GetInData
Joshua Wankowski, Associate Data Engineer, Northwest Mutual
Dillon Stadther

Agenda:

0.22.0 preview [Willy]
lifecycleStateChange support [Pawel]
Updates to job renaming and symlinking [Michael C.]

Meeting:

Notes:

Announcements [Willy]:
- Cool swag is available! https://www.astronomer.io/datakin-swag
- Willy has two talks about Marquez upcoming:
- Airflow Summit: https://airflowsummit.org/program/
- Open Source Summit: https://sched.co/11NgS
0.22.0 Preview [Willy]:
- lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
- Pawel:
  - change motivated by desire for more information about datasets
  - approach started out with the Spark integration
  - still more information about lifecycle changes is possible/desirable
  - additional feature idea: notification console friendly to backend developers
- Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
- Pawel: panel on website could display changes to dataset over X days
  - Agreed. Create an issue and we can build on that idea.
- Helm chart addition
  - allows annotations, e.g. Prometheus metrics
- Support for renaming and redirection
  - introducing job hierarchy
  - symlink will permit visibility into name changes to datasets
Updates to job renaming and symlinking [Michael C.]
- stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
- challenge: linking old jobs to fully qualified version
- motivating factor: changes to job names results in junk nodes on graph
- there was no way to remove the old job names from the graph
- but there is frequently a need to keep track of old job names
- hence the idea of symlinking a job
- currently there's no API to do this
- updating must be done manually currently
  - add the UUID of the new job to the db
  - from that point on, the job history will redirect to the new job (with a 301)
- future: API will make this possible programmatically
- Willy: is documentation needed for this?
  - Yes, I will post a change to the README
  - We want to do the same thing for datasets
Open discussion
- Gage: is a home repo coming?
  - Willy: Minkyu has looked into this
  - Willy: we want to add the Helm chart to the new website
  - Willy: this is on our radar
- New release coming soon!

March 31, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Engineer, Astronomer
Julien Le Dem, Chief Architect, Astronomer
Peter Hicks, Senior Engineer, Astronomer

And:

Ross Turk, Sr. Director of Community, Astronomer
Minkyu Park, Senior Engineer, Astronomer
John Thomas, Support Engineer, Astronomer
Michael Robinson, Developer Relations Engineer, Astronomer
Howard Yoo, Staff Product Manager, Astronomer

Agenda:

Website update
Backlog and roadmap discussion
Open discussion

Meeting:

Slides

Notes:

Announcements [Michael R.]

Marquez stickers are now available: https://www.astronomer.io/datakin-swag
Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
The project's Github star count stands at 983. Have you starred the project yet?
1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.

Website [Ross]

The project now has a new website.
Appropriately, it's an open-source project; PRs are welcome.
Tech: Gatsby, Github Projects
Dev: run yarn deploy to work on it
Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.

Backlog and roadmap [Willy]

Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
How to get the broader community involved? Want people to have more input/control over the issues we take up.
Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
Process: review issues on monthly basis, move to roadmap, then release.
Question from Howard about how to propose new features
Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.

Open discussion

Michael C.: please note issue #1928: supporting job grouping and hierarchy.
- Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
- Proposal: a parent_job_id column should be added to the jobs table and to the runs table, both being uuids.
Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.

February 24, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Michael Collado, Staff Engineer, Datakin

And:

Minkyu Park, Senior Engineer, Datakin
Michael Robinson, Developer Relations Engineer, Datakin
Ross Turk, VP of Marketing, Datakin

Agenda:

Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
Demo: OL Java client
Dataset lifecycle management
Open discussion

Meeting:

Slides

Notes:

Announcements [Willy]
- Release date of 0.21.0 is now 2/28
- Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
  - We hope to have this merged for the next OL release
Integrations and OL demo [Willy]
- OL integration
  - Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
  - Requirements.txt needs to install airflow
  - Set OpenLineage URL to local instance of Marquez
  - Marquez is moving towards using a task listener to pull metadata in real time
  - For now use the OL Airflow DAG
  - You can still use the OL backend; there are limitations there, however
- Spark integration
  - When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)
  - Point the host to your deployment
  - See the OL website for more details (openlineage.io/integration/spark-spark)
- Upcoming: Flink and Kafka
- Your feedback on these integrations appreciated
- There are many connections you can use in your platform by switching over to OL to collect metadata
OL Java client demo [Willy]
- The Java client employs a workflow with interface
- Definition of run method required
- Instance of database required
- This ex: simpleworkflow with database via newDatabase method
- Relies on a Job class
- In Marquez you can see the calls
- For the code see https://github.com/DatakinHQ/demo/tree/main/custom/java/simple
Dataset lifecycle management [Willy]
- Marquez can now capture changes to dataset names
- Community voiced desire for this feature
- Marquez now supports soft deletes of datasets
- See PR #1847
- Support of lifecycle now more concrete: can see the phases datasets go through
Open discussion
- Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
- Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark

January 27, 2022

Attendees:

TSC:

Willy Lulciuc, Co-creator of Marquez
Julien Le Dem, CTO of Datakin
Michael Collado, Staff Engineer, Datakin
Peter Hicks, Senior Engineer, Datakin
Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual

And:

Ross Turk, VP of Marketing, Datakin
Minkyu Park, Senior Engineer, Datakin
John Thomas, Support Engineer, Datakin
Michael Robinson, Developer Relations Engineer, Datakin

Agenda:

Marquez recent releases overview [Willy]
- Marquez release 0.21.0 overview
  - Upgrade to Java17
Migrating integrations to OpenLineage [Willy]
Cloud-based development instance of Marquez via Gitpod [Peter]
Open discussion

Meeting:

Slides

Notes:

0.21.0 overview [Willy]
- Features:
  - Bug fixes
  - Removal of excess code
  - Upgrade to Java17
    - API image migrated
    - Eclipse Temurin integrated
    - All CI deployment updated to support Java17
- Discussion [Kevin, Willy, Michael C.]:
  - Support for Java client possible in lower version
  - Proposed: schedule separate meeting about this
Migrating integrations to OpenLineage [Willy]
- Spark library in Marquez now deprecated
- Use of OpenLineage Spark integration recommended going forward
  - review the docs about how to configure your instance
  - remember to add underscore to marquez_airflow
- OpenLineage integration allows task listener
  - workaround: import DAG from OpenLineage
- See the changelog: environment variables for the Airflow instance have changed
Cloud-based development instance of Marquez [Peter]
- Enabled by integration of Gitpod
- Docker image in the cloud with Marquez and UI
- Ideal for those not ready to install everything locally or who are having issues with their OS
- Fast (30 seconds), eliminates risk
- API also available
- Can be made private or public
- Big advantage: shareable within organizations via URL
- Supports everything one could do locally in VS Code or similar IDE
- Discussion [Willy, Peter, Kevin, Julien]:
  - common use case: potential users want to see metadata from their org and share the tool
  - potential side-effect: increase in Docker pulls
  - availability of metrics unknown
  - email address required
Open Discussion
- Advantages of possible move from CircleCI to Github Actions
  - CircleCI downsides: outages, billing issues [Willy]
  - Julien proposed: moving to Github actions eventually after running both in parallel
  - Kevin asked to experiment with Github Actions and report back
- Issue #1800: add support for table operations reported from OpenLineage
  - Formal solution needed [Willy]
  - Willy proposed: deploy in two modes and use flags (Julien agreed)
- NodeID
  - An easy win: add a field that returns a nodeID [Willy]
  - Willy proposed: prioritize in next release

Marquez Workflow Group Calendar Overview

Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:

Be logged into LF AI Foundation Groups.io
Be subscribed to the sub-group(mail-list) you're interested in
Thereafter, you will see all the calendars for the sub-groups you subscribe to under your LF AI Foundation Group Calendar via Groups.io OR
You can also view a specific group calendar via the Wiki (if the group has created a Wiki group calendar) whether you are a member of the sub-group (mail list) or not
- Example: LF AI TAC Group Calendar (tac-general@lists...) via Wiki

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars

Marquez Meetings List

Schedule

Title

Owner

Subgroup (mail list)

Purpose

Dial In Link

Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone)

Meeting Title (Zoom Account Used)

Meeting Owner/Moderator

marquez-mail-list@lists.lfai.foundation

Meeting Purpose

Zoom Name: https://zoom.us/...

Marquez Group Calendar

EDIT THE CALENDAR
Customise the different types of events you'd like to manage in this calendar.
#legIndex/#totalLegs
RESTRICT THE CALENDAR
Optionally, restrict who can view or add events to the team calendar.
#legIndex/#totalLegs
SHARE WITH YOUR TEAM
Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.
#legIndex/#totalLegs
ADD AN EVENT
The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.
#legIndex/#totalLegs
SUBSCRIBE
Subscribe to calendars using your favourite calendar client.
#legIndex/#totalLegs

Space shortcuts

Page tree

Marquez - Community Meetings & Calendar

August 25, 2022

July 28, 2022

Attendees:

Agenda:

Meeting:

Notes:

June 23, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]

Recent Release 0.23.0 [Michael R.]

Keboola Use Case [Martin]

May 26, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]

Northwestern Mutual Use Case [Joshua]

VMware use case [Antoni]

April 28, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]:

0.22.0 Preview [Willy]:

Updates to job renaming and symlinking [Michael C.]

March 31, 2022

Attendees:

Agenda:

Meeting:

Slides

Notes:

Announcements [Michael R.]

Website [Ross]

Backlog and roadmap [Willy]

Open discussion

February 24, 2022

Attendees:

Agenda:

Meeting:

Notes:

Announcements [Willy]

Integrations and OL demo [Willy]

OL Java client demo [Willy]

Dataset lifecycle management [Willy]

Open discussion

January 27, 2022

Attendees:

Agenda:

Meeting:

Notes:

0.21.0 overview [Willy]

Migrating integrations to OpenLineage [Willy]

Cloud-based development instance of Marquez [Peter]

Open Discussion

Marquez Workflow Group Calendar Overview

View Instructions on How to Subscribe to LF AI Group Calendars

For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars

Marquez Meetings List

Marquez Group Calendar