Summary
Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Dean Pleban and Guy Smoilovsky about DagsHub, a platform to track experiments, and version data, models & pipelines for your data science and machine learning projects.
Interview
- Introduction
- How did you first get introduced to Python?
- Can you start by describing what the DagsHub platform is and why you built it?
- There are a number of projects and platforms that aim to support collaboration among data scientists. What are the distinguishing features of DagsHub and how does it compare to the other options in the ecosystem?
- What are the biggest opportunities for improvement that you still see in the space of collaboration on data projects?
- What do you see as the biggest points of friction for building experiments and managing source data collaboratively?
- Can you describe how the DagsHub platform is implemented?
- How have the design and goals of the system changed or evolved since you first began working on it?
- How has your own understanding and practices of working on data science/ML projects changed changed?
- GitHub has a number of convenience features beyond just storing a git repository. What are the capabilities that you are focusing on to add value to the data science workflow within DagsHub?
- How are you approaching the bootstrapping problem of building a critical mass of users to be able to generate a beneficial network effect?
- Are there any conventions that make it easier or more familiar for newcomers to a given project? (e.g. code layout, data labeling/tagging formats, etc.)
- What are your recommendations for managing onwership/licensing of data assets in public projects?
- What are some of the most interesting, innovative, or unexpected ways that you have seen DagsHub used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building DagsHub?
- When is DagsHub the wrong choice?
- What do you have planned for the future of the platform and business?
Keep In Touch
Follow us on Twitter or LinkedIn, join our Discord, sign up to DAGsHub
Picks
- Tobias
- The Remarkable Journey of Prince Jen by Lloyd Alexander
- Dean
- Quantum Computing Since Democritus by Scott Aaronson
- The Expanse TV Series
- Guy
- Try to consume only the very best of available content, not the things that are coming out right now.
- Applies to textbooks, TV shows, movies
- Less Wrong blog
- Slate Star Codex \ Astral Codex Ten
- Avatar: The Last Airbender
- 3 Blue 1 Brown YouTube Channel
- Haskell
- Clojure
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA