From ML experiments to production:
Versioning and Reproducibility with MLV-tools
Stéphanie Bracaloni and Sarah Diot-Girard
About Us
Sarah Diot-Girard
Data Scientist since 2012
Interested in DataOps and Ethics
@SgdJlbl
Stéphanie Bracaloni
Software Engineer since 2013
Automation and Code Quality
@sbracaloni
The story, all names, characters, and incidents portrayed in this production are fictional.
All similarities with existing past or future Data Science projects are purely coincidental.
@SDG
Obligatory DISCLAIMER
Monday morning, 9am
@SDG
It's Monday morning, 9am.
You're an happy DS. Friday night, you finally made a breakthrough in your research project.
you sent that to your boss.
Confusion Matrix
ROC Curve
@SDG
You sent the results to your boss.
She loves it. She want it in production asap.
Monday morning, 10am
@SDG (@SBI => "It's me")
It's Monday morning, 10am.
You are asking your SE coworker for help. You showed her your POC and now she looks like she's gonna faint
and she wants to kill you AT THE SAME TIME!
@SDG
- git repo with format not really compatible with git versioning
- hardcoded stuff (path, user, ...)
- hardcoded hyperparameters
Two weeks later
@SDG
Two weeks later.
It's Monday morning, 9am.
You have worked hard to clean up the POC. It's a bit better but...
@SDG
Two weeks later.
It's Monday morning, 9am.
You have worked hard to clean up the POC. It's a bit better but...
@SBI
-Python lib + collection of tests
- CI to auto test running + code quality + PACKAGING
- No more NB, pipeline == set of scripts
- NO Data in the repo
=> long time we know how to structure CODE PRJ => tests auto and quality checks
and GIT perfect for versioning
@SDG=> Yes, but you reached a point here! It is not just a CODE PRJ
POC vs PROD
vs Data Scientist
vs Software Engineer
@SDG
This is the story of
On the source versioning, now the SE and the DS are happy
Not quite yet
@SDG still a young project
Suitable to our use cases, eager to hear about your usage and problematics
Gladly welcoming contributions and feature requests
For the data scientists
Modular improvements
Fast iterations
Confidence in results
Cannot loose data
For the engineers
Full power of IDE
Functional tests
CI and production-ready
Fast iterations thanks to saved intermediary results
@SBI: IDE => Enjoy IDE features
Now plan how to pacjgae to go in prod
...
Current limitations
DVC is another CLI tool
Git and DVC checkout for switching branches
Under development
Checking if all scripts are up-to-date when pushing
Easily comparing metrics between experiments
Pipeline packaging (with Debian and Anaconda)
Dynamically handling of hyperparameters
@SBI
- What if push commit with modif nb no update script... INCONSISTENTCY..
=> SMART diff tool => ci or git hook
- Packaging (alradey export outside DVC, static , interactive, conf) + packaging
Future work
DVC pipelines as powerful as scikit-learn pipelines
Cross-validation
Hyperparameter tuning
Future work
We want to hear about your use cases!
Contact Us !
sarah_diot-girard@ultimatesoftware.com
stephanie_bracaloni@ultimatesoftware.com