Mastering Data Preprocessing and Version Control for Robust Machine Learning: A Practical Guide

Jun 4, 2:00 – 5:00 PM

This time, we are going to focus on data version control, adhering to the normal data science workflow. This involves meticulously documenting everything we do in our in-silico laboratory. Grab your ticket today!

48
RSVPs

CommunityMeetupSMS API

About this event


You are starting to become an elder wizard! We've learned that the types of data can limit the sort of analysis we can do (S.S Stevens, 1946. "On the theory of scales of measurement," Science, Vol. 103, No. 2684). However, we've enhanced our data science and machine learning capabilities by diving into exploratory data analysis (EDA), cross-validation, feature engineering, and modeling. Our goal is to tame data and extract valuable insights. Remember, we live in the matrix, but sometimes we love vectors tooβ€”so do our models. You can find advice from Elder Mage πŸ§™πŸΎπŸš§ here.

This time, we are going to focus on data version control, adhering to the normal data science workflow. This involves meticulously documenting everything we do in our in-silico (performed on computer or via computer simulations) laboratory. Proper workflow management is crucial, especially in production environments where issues can arise rapidly. Good workflows help us troubleshoot what might go wrong during machine learning training.


We will introduce you to the powerful Makefile, an organizational tool for project workflows. This session will cover all stages from development to production, including continuous integration (CI) and continuous deployment (CD), all orchestrated within one file. Furthermore, we will introduce DVC, a tool to help track changes in data files, plots, machine learning models, and metrics, ensuring reproducibility (ability to get the same results using the same data and code) and shareability of your workflows. This helps tackle the "it worked on my computer" problem to a significant extent.

The problem we'll tackle will be related to insurance. Using a dataset from Kaggle, a machine learning platform for learning, experimenting, and competing, we will build a predictive model to identify the factors influencing insurance costs. Our workflows will aim to run the entire data science process in single or incremental steps, providing documentation throughout. We are data storytellers, which is part of being a wizard! πŸ§™πŸΏπŸ“œπŸ—£


Jargon simplified:

  • Exploratory Data Analysis (EDA): The process of analyzing data sets to summarize their main characteristics, often using visual methods.
  • Cross-Validation: A technique used to assess how the results of a statistical analysis will generalize to an independent data set.
  • Feature Engineering: Creating new features or transforming existing features to improve the performance of a machine-learning model. In other words, expressing your problem in a simpler way.
  • Continuous Integration (CI): A practice where developers integrate code into a shared repository frequently to identify problems early.
  • Continuous Deployment (CD): An extension of CI, where the code changes are automatically deployed to production after passing the CI pipeline.
  • Reproducibility: The ability to get the same results using the same data and code.
  • Modeling: A set of rules that represent our data and can be used to make predictions. Credit: Luis Serrano (2020). Grokking Machine Learning. Manning Publications.
  • Machine Learning: is about computers making decisions based on previous data. Common sense except done by a computer. Credit: Luis Serrano (2020). Grokking Machine Learning. Manning Publications.

What you'll need:

  1. An Africa's Talking account with an application
  2. A kaggle account and the command line interface (CLI)
  3. A Computer even Copilots/Agents are welcome.
  4. Install Quarto by following the instructions on the Quarto website. Quarto is available for Windows, macOS, and Linux.
  5. Install a code editor or integrated development environment (IDE) of your choice. Some popular options for data science include:
     6. Install the anaconda or miniconda if you don't have enough space for the other one. This is a software that helps you manage your virtual environments for Python, R, Julia as well as the packages they have. We will use python 3.12 for this.


Join us in this month's meetup to become an Elder Mage πŸ§™πŸΏ too. We will be closing registration early, so grab your friends and come learn to be dangerous.


Outline

  • Part 1: The Problem (40 minutes)
  • Break (5 minutes)
  • Part 2: Hands-on with DVC & Makefiles (90 minutes)
  • Break (5 minutes)
  • Part 3: Q&A and Further Exploration (40 minutes)

Gigs:

We would love to reach out to you so that you can build for our customers, please fill out this form with details to ensure we have your details:

GIG/HACK DEVELOPER PORTFOLIO FORM

Join community channels:

Africa's Talking AI/ML Community:

Slack:

Please follow our Twitter handles too:

You can get our videos, recaps, and event interviews on our youtube channels, subscribe to get updates:

Africa's Talking community allows developers to learn skills for the modern-day African Developer. We are language and framework agnostic. All developers are welcome. This is where Africa's Talking developers community meets to build, learn and exchange knowledge.

We are helping software developers and businesses to bring their ideas to life through easy-to-use APIs easily.

Would you like to partner with us? Kindly contact the Developer Experience Team

Speaker

  • Mainye Ben

    Africa's Talking LTD

    Data Scientist & Maker

Featured Attendees

  • Wenslous Otema

    AI Center of Exellence

    Senior AI Engineer

  • Enos Ngetich

    MMUST

    Student

  • Danfold Mosongo

    CANDIT

    CEO-FOUNDER

  • Joseph Kitipai Mpaapa

    University of Nairobi

    Student

When

When

Tuesday, June 4, 2024
2:00 PM – 5:00 PM UTC

Agenda

Arrival and Introduction
The Problem
Break
Hands-on with DVC & Makefiles
Break
Q&A and Further Exploration
Closing Remarks and Networking

Host

  • Josphat Mwangi

    Africa's Talking

    Community Lead Nairobi

Organizers

  • Mainye Ben

    Africa's Talking

    Data Scientist

  • Sylvia Jebet Kipkemoi

    Dev Rel Associate, AT Women in Tech Co-Lead

  • Josphat Mwangi

    Co-Lead AT AI/ML Community

Global sponsor

Elarian logo

Elarian

CONTACT US