Best Practices

Before starting a machine learning project

Ask yourself

What is your scientific problem?
Can this scientific problem be transformed to machine learning problem?
- Keep the problem simple; if not, decompose it
Do you really have to use machine learning?

Continue asking…

What is the goal of your ML project?
Do you have enough high-quality data?
How do you measure the model performance?
- First design and implement metrics
Do you have good enough infrastructure?
Are there any risks related to privacy and ethics?
- Deon ethics checklist

During doing machine learning

Workflow or pipeline

Having a bad workflow is better than nothing
- Make one and then optimize it

Data

Be very patient with data engineering
Split data to training, validation and test sets
NEVER mix using data:
- training data only for training
- validation data only for validation (picking model)
- test data only for test (estimating generalization performance)
Use common-sense features
Borrow features from state-of-the-art models

Model

Set a baseline performance/model
- use state-of-the-art model
- human performance
- guess it with your experience
Keep your first model simple
Be patient with training
- It is an iterative cycle to improve your model

After training

Versioning

Version your data, code and everything
- using git and github
- MLFlow
- Weights & Biases

Re-train

Retrain the model when possible
- e.g. new data, new features

Thank you

Q&A