Marquez Monthly Community Meeting
The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.
Next meeting: April 28, 2022
March 31, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
- Peter Hicks, Senior Engineer, Astronomer
And:
- Ross Turk, Sr. Director of Community, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Michael Robinson, Developer Relations Engineer, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
Agenda:
- Website update
- Backlog and roadmap discussion
- Open discussion
Meeting:
- Zoom recording
- Passcode: !4gq*v01
Notes:
Announcements [Michael R.]
- Marquez stickers are now available: https://www.astronomer.io/datakin-swag
- Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
- The project's Github star count stands at 983. Have you starred the project yet?
- 1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.
Website [Ross]
- The project now has a new website.
- Appropriately, it's an open-source project; PRs are welcome.
- Tech: Gatsby, Github Projects
- Dev: run
yarn deploy
to work on it - Plans: blog page. Proposals for posts welcome. Post in Slack or open a PR.
Backlog and roadmap [Willy]
- Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset version's, Pawel's lifecycle PR)
- How to get the broader community involved? Want people to have more input/control over the issues we take up.
- Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
- Process: review issues on monthly basis, move to roadmap, then release.
- Question from Howard about how to propose new features
- Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
- Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.
Open discussion
- Michael C.: please note issue #1928: supporting job grouping and hierarchy.
- Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
- Proposal: a
parent_job_id
column should be added to the jobs table and to the runs table, both being uuids.
- Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.
February 24, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Engineer, Datakin
And:
- Minkyu Park, Senior Engineer, Datakin
- Michael Robinson, Developer Relations Engineer, Datakin
- Ross Turk, VP of Marketing, Datakin
Agenda:
- Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
- Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
- Demo: OL Java client
- Dataset lifecycle management
- Open discussion
Meeting:
- Slides
- Zoom recording
- Passcode: Q89sjr.c
Notes:
Announcements [Willy]
- Release date of 0.21.0 is now 2/28
- Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
- We hope to have this merged for the next OL release
Integrations and OL demo [Willy]
- OL integration
- Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
- Requirements.txt needs to install airflow
- Set OpenLineage URL to local instance of Marquez
- Marquez is moving towards using a task listener to pull metadata in real time
- For now use the OL Airflow DAG
- You can still use the OL backend; there are limitations there, however
- Spark integration
When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)
Point the host to your deployment
- See the OL website for more details (openlineage.io/integration/spark-spark)
- Upcoming: Flink and Kafka
- Your feedback on these integrations appreciated
- There are many connections you can use in your platform by switching over to OL to collect metadata
OL Java client demo [Willy]
- The Java client employs a workflow with interface
- Definition of run method required
- Instance of database required
- This ex: simpleworkflow with database via newDatabase method
- Relies on a Job class
- In Marquez you can see the calls
- For the code see https://github.com/DatakinHQ/demo/tree/main/custom/java/simple
Dataset lifecycle management [Willy]
- Marquez can now capture changes to dataset names
- Community voiced desire for this feature
- Marquez now supports soft deletes of datasets
- See PR #1847
- Support of lifecycle now more concrete: can see the phases datasets go through
Open discussion
- Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
- Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark
January 27, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Julien Le Dem, CTO of Datakin
- Michael Collado, Staff Engineer, Datakin
- Peter Hicks, Senior Engineer, Datakin
- Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual
And:
- Ross Turk, VP of Marketing, Datakin
- Minkyu Park, Senior Engineer, Datakin
- John Thomas, Support Engineer, Datakin
- Michael Robinson, Developer Relations Engineer, Datakin
Agenda:
- Marquez recent releases overview [Willy]
- Marquez release 0.21.0 overview
- Upgrade to Java17
- Marquez release 0.21.0 overview
- Migrating integrations to OpenLineage [Willy]
- Cloud-based development instance of Marquez via Gitpod [Peter]
- Open discussion
Meeting:
- Slides
- Zoom recording
- Passcode: Ef2^5Wwg
Notes:
0.21.0 overview [Willy]
- Features:
- Bug fixes
- Removal of excess code
- Upgrade to Java17
- API image migrated
- Eclipse Temurin integrated
- All CI deployment updated to support Java17
- Discussion [Kevin, Willy, Michael C.]:
- Support for Java client possible in lower version
- Proposed: schedule separate meeting about this
Migrating integrations to OpenLineage [Willy]
- Spark library in Marquez now deprecated
- Use of OpenLineage Spark integration recommended going forward
- review the docs about how to configure your instance
- remember to add underscore to marquez_airflow
- OpenLineage integration allows task listener
- workaround: import DAG from OpenLineage
- See the changelog: environment variables for the Airflow instance have changed
Cloud-based development instance of Marquez [Peter]
- Enabled by integration of Gitpod
- Docker image in the cloud with Marquez and UI
- Ideal for those not ready to install everything locally or who are having issues with their OS
- Fast (30 seconds), eliminates risk
- API also available
- Can be made private or public
- Big advantage: shareable within organizations via URL
- Supports everything one could do locally in VS Code or similar IDE
- Discussion [Willy, Peter, Kevin, Julien]:
- common use case: potential users want to see metadata from their org and share the tool
- potential side-effect: increase in Docker pulls
- availability of metrics unknown
- email address required
Open Discussion
- Advantages of possible move from CircleCI to Github Actions
- CircleCI downsides: outages, billing issues [Willy]
- Julien proposed: moving to Github actions eventually after running both in parallel
- Kevin asked to experiment with Github Actions and report back
- Issue #1800: add support for table operations reported from OpenLineage
- Formal solution needed [Willy]
- Willy proposed: deploy in two modes and use flags (Julien agreed)
- NodeID
- An easy win: add a field that returns a nodeID [Willy]
- Willy proposed: prioritize in next release
Marquez Workflow Group Calendar Overview
Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:
Be logged into LF AI Foundation Groups.io
Be subscribed to the sub-group(mail-list) you're interested in
Thereafter, you will see all the calendars for the sub-groups you subscribe to under your LF AI Foundation Group Calendar via Groups.io OR
You can also view a specific group calendar via the Wiki (if the group has created a Wiki group calendar) whether you are a member of the sub-group (mail list) or not
View Instructions on How to Subscribe to LF AI Group Calendars
For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars
Marquez Meetings List
Schedule | Title | Owner | Subgroup (mail list) | Purpose | Dial In Link |
---|---|---|---|---|---|
Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone) | Meeting Title (Zoom Account Used) | Meeting Owner/Moderator | marquez-mail-list@lists.lfai.foundation | Meeting Purpose | Zoom Name: https://zoom.us/... |
Marquez Group Calendar
- EDIT THE CALENDAR
Customise the different types of events you'd like to manage in this calendar.
#legIndex/#totalLegs - RESTRICT THE CALENDAR
Optionally, restrict who can view or add events to the team calendar.
#legIndex/#totalLegs - SHARE WITH YOUR TEAM
Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.
#legIndex/#totalLegs - ADD AN EVENT
The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.
#legIndex/#totalLegs - SUBSCRIBE
Subscribe to calendars using your favourite calendar client.
#legIndex/#totalLegs