Best Practices

Before starting a machine learning project

Ask yourself

  • What is your scientific problem?
  • Can this scientific problem be transformed to machine learning problem?
    • Keep the problem simple; if not, decompose it
  • Do you really have to use machine learning?

Continue asking…

  • What is the goal of your ML project?
  • Do you have enough high-quality data?
  • How do you measure the model performance?
    • First design and implement metrics
  • Do you have good enough infrastructure?
  • Are there any risks related to privacy and ethics?

During doing machine learning

Workflow or pipeline

  • Having a bad workflow is better than nothing
    • Make one and then optimize it

Data

  • Be very patient with data engineering
  • Split data to training, validation and test sets
  • NEVER mix using data:
    • training data only for training
    • validation data only for validation (picking model)
    • test data only for test (estimating generalization performance)
  • Use common-sense features
  • Borrow features from state-of-the-art models

Model

  • Set a baseline performance/model
    • use state-of-the-art model
    • human performance
    • guess it with your experience
  • Keep your first model simple
  • Be patient with training
    • It is an iterative cycle to improve your model

After training

Versioning

Re-train

  • Retrain the model when possible
    • e.g. new data, new features

Thank you

Q&A