Tobias Macey製作的The Python Podcast.__init__
Tobias Macey
分类: 科技
听听最后一集:
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
- Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe what River is and the story behind it?
- What is "online" machine learning?
- What are the practical differences with batch ML?
- Why is batch learning so predominant?
- What are the cases where someone would want/need to use online or streaming ML?
- The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like?
- Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift?
- Can you describe how the River framework is implemented?
- How have the design and goals of the project changed since you started working on it?
- How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state?
- In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision?
- Can you describe the process of using River to design, implement, and validate a streaming ML model?
- What are the operational requirements for deploying and serving the model once it has been developed?
- What are some of the challenges that users of River might run into if they are coming from a batch learning background?
- What are the most interesting, innovative, or unexpected ways that you have seen River used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on River?
- When is River the wrong choice?
- What do you have planned for the future of River?
Contact Info
- @halford_max on Twitter
- MaxHalford on GitHub
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- River
- scikit-multiflow
- Federated Machine Learning
- Hogwild! Google Paper
- Chip Huyen concept drift blog post
- Dan Crenshaw Berkeley Clipper MLOps
- Robustness Principle
- NY Taxi Dataset
- RiverTorch
- River Public Roadmap
- Beaver tool for deploying online models
- Prodigy ML human in the loop labeling
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Sponsored By:
- Linode: Do you want to try out some of the tools and applications that you heard about on Podcast.\_\_init\_\_? Do you have a side project that you want to share with the world? With Linode's managed Kubernetes platform it's now even easier to get started with the latest in cloud technologies. With the combined power of the leading container orchestrator and the speed and reliability of Linode's object storage, node balancers, block storage, and dedicated CPU or GPU instances, you've got everything you need to scale up. Go to [pythonpodcast.com/linode](https://www.pythonpodcast.com/linode) today and get a $100 credit to launch a new cluster, run a server, upload some data, or... And don't forget to thank them for being a long time supporter of Podcast.\_\_init\_\_!
以前的剧集
-
799 - Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River Sun, 11 Dec 2022
-
798 - Declarative Machine Learning For High Performance Deep Learning Models With Predibase Sun, 04 Dec 2022 - 0h
-
797 - Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks Sun, 27 Nov 2022 - 0h
-
796 - Build A Full Stack ML Powered App In An Afternoon With Baseten Mon, 21 Nov 2022 - 0h
-
795 - Skip Straight To The Fun Part Of Your Project With PyScaffold Sun, 06 Nov 2022 - 0h
-
794 - Threading The Needle Of Interesting And Informative While You Learn To Code Mon, 05 Oct 2020 - 0h
-
793 - Solving Python Package Creation For End User Applications With PyOxidizer Mon, 28 Sep 2020 - 0h
-
792 - Flexible Network Security Detection And Response With Grapl Mon, 21 Sep 2020 - 0h
-
791 - Simplified Data Extraction And Analysis For Current Events With Newspaper Mon, 14 Sep 2020 - 0h
-
790 - Digging Into Dagster: An Opinionated Open Source Framework For Data Orchestration Mon, 07 Sep 2020 - 0h
-
789 - When, Why, and How To Use Web Scraping In A Nutshell Mon, 31 Aug 2020 - 0h
-
788 - Working In The Code Mines: Mining Software Repositories With PyDriller Mon, 24 Aug 2020 - 0h
-
787 - Building The Open Data Ecosystem For Music And More At Metabrainz Mon, 17 Aug 2020 - 0h
-
786 - Growing Dask To Make Scaling Python Data Science Easier At Coiled Mon, 10 Aug 2020 - 0h
-
785 - Supporting The Full Lifecycle Of Machine Learning Projects With Metaflow Mon, 03 Aug 2020 - 0h
-
784 - Learning To Program By Building Tiny Python Projects Mon, 27 Jul 2020 - 0h
-
783 - Idiomatic Functional Programming With DRY Python Mon, 20 Jul 2020 - 0h
-
782 - The Past, Present, And Future Of The FLUFL: Barry Warsaw Shares His History With Python Mon, 13 Jul 2020 - 0h
-
781 - Pure Python Configuration Management With PyInfra Mon, 06 Jul 2020 - 0h
-
780 - Build Your Own Domain Specific Language in Python With textX Mon, 29 Jun 2020 - 0h
-
779 - Adding Observability To Your Python Applications With OpenTelemetry Mon, 22 Jun 2020 - 0h
-
778 - Build A Personal Knowledge Store With Topic Modeling In Contextualize Mon, 15 Jun 2020 - 0h
-
777 - Open Source Product Analytics With PostHog Mon, 08 Jun 2020 - 0h
-
776 - Extending The Life Of Python 2 Projects With Tauthon Mon, 01 Jun 2020 - 0h
-
775 - Dependency Management Improvements In Pip's Resolver Mon, 25 May 2020
-
774 - Easy Data Validation For Your Python Projects With Pydantic Mon, 18 May 2020 - 0h
-
773 - Managing Distributed Teams In The Age Of Remote Work Mon, 11 May 2020 - 0h
-
772 - Maintainable Infrastructure As Code In Pure Python With Pulumi Mon, 04 May 2020
-
771 - Teaching Python Machine Learning Mon, 27 Apr 2020 - 0h
-
770 - Build The Next Generation Of Python Web Applications With FastAPI Sun, 19 Apr 2020 - 0h
-
769 - Distributed Computing In Python Made Easy With Ray Mon, 13 Apr 2020 - 0h
-
768 - Building The Seq Language For Bioinformatics Mon, 06 Apr 2020 - 0h
-
767 - An Open Source Toolchain For Natural Language Processing From Explosion AI Mon, 30 Mar 2020 - 0h
-
766 - A Flexible Open Source ERP Framework To Run Your Business Mon, 23 Mar 2020
-
765 - Getting A Handle On Portable C Extensions With hpy Mon, 16 Mar 2020 - 0h
-
764 - Open Source Machine Learning On Quantum Computers With Xanadu AI Mon, 09 Mar 2020 - 0h
-
763 - The Advanced Python Task Scheduler Mon, 02 Mar 2020 - 0h
-
762 - Reducing The Friction Of Embedded Software Development With PlatformIO Mon, 24 Feb 2020 - 0h
-
761 - APIs, Sustainable Open Source and The Async Web With Tom Christie Tue, 18 Feb 2020 - 0h
-
760 - Learning To Program Python By Building Video Games With Arcade Mon, 10 Feb 2020 - 0h
-
759 - Build Your Own Personal Data Repository With Nostalgia Mon, 03 Feb 2020 - 0h
-
758 - Simplifying Social Login For Your Web Applications Sun, 26 Jan 2020 - 0h
-
757 - Building A Business On Building Data Driven Businesses Mon, 20 Jan 2020 - 0h
-
756 - Using Deliberate Practice To Level Up Your Python Mon, 13 Jan 2020 - 0h
-
755 - Checking Up On Python's Role in DevOps Sun, 05 Jan 2020 - 0h
-
754 - Python's Built In IDE Isn't Just Sitting IDLE Mon, 23 Dec 2019 - 0h
-
753 - Riding The Rising Tides Of Python Mon, 16 Dec 2019 - 0h
-
752 - Debugging Python Projects With PySnooper Sun, 08 Dec 2019 - 0h
-
751 - Making Complex Software Fun And Flexible With Plugin Oriented Programming Mon, 02 Dec 2019
-
750 - Faster And Safer Software Development With Feature Flags Tue, 26 Nov 2019