@def title = "Data Science Tutorials in Julia"
This website offers tutorials for MLJ.jl and related packages.
The code included on each tutorial is tested to work reliably under these two conditions:
-
You are running Julia 1.7.x where "x" is any integer (to check, enter
VERSION
at the REPL). -
You have activated and instantiated the associated package environment.
To make the tutorial-specific environment available to you, first download (and decompress) the "project folder" that is linked near the top of the tutorial. How you proceed next depends on your chosen mode of interaction:
Recommended for new Julia users.
Activate and instantiate the correct environment by entering this code
at the julia>
prompt:
using Pkg; Pkg.activate("Path/To/Project/Folder"); Pkg.instantiate()
You need to replace "Path/To/Project/Folder"
with the actual path to
the downloaded project folder. This can be just "."
if Julia has been
launched from the command-line, with the project folder as the current
directory.
This might take a few minutes for some tutorials, as packages may need to be installed and precompiled.
The downloaded project folder contains a Juptyer notebook called
tutorial.ipynb
. See the IJulia
documentation
on how to launch it. Copy and execute the code fragment above in a new
notebook cell before evaluating any other cells.
In your IDE (e.g., VS Code or emacs) open the file called
tutorial.jl
in the downloaded project folder and
activate/instantiate by first running the code fragment given above.
Please report issues here. For beginners, the most common issues arise because the Julia version is incorrect, or because of an incorrect package environment. So be sure you have tried the instructions above before raising an issue.
If you need to use an earlier version of Julia, you can try deleting
the Manifest.toml
file contained in the project folder and running
using Pkg; Pkg.instantiate()
to generate a new package environment,
but the exact package versions will be different from those used to
test the tutorial and generate the output seen on the tutorial web
page.
If you have some programming experience but are otherwise fairly new to data processing in Julia, you may appreciate the following few tutorials before moving on. In these we provide an introduction to some of the fundamental packages in the Julia data processing universe such as DataFrames, CSV and CategoricalArrays.
- How to load data,
- Short intro to dataframes,
- Dealing with categorical data
- Specifying data interpretation
If you are new to MLJ but are familiar with Julia and with Machine Learning, we recommend you start by going through the short Getting started examples in order:
- How to choose a model,
- How to fit, predict and transform
- How to tune models
- How to ensemble models
- How to ensemble models (2)
- More on ensembles
- How to compose models
- How to build a learning network
- How to create models from learning networks
- An extended tutorial on stacking
Additionally, you can refer to the documentation for more detailed information.
This is a sequence of tutorials adapted from the labs associated with An introduction to statistical learning which were originally written in R. These tutorials may be useful if you want a gentle intro to MLJ and other relevant tools in the Julia environment. If you're fairly new to Julia and ML, this is probably where you should start.
Note: the adaptation is fairly liberal, adding content when it helps highlights specificities with MLJ and removing content when it seems unnecessary. Also note that some of the things used in the ISL labs are not (yet) supported by MLJ.
- Lab 2, a very short intro to Julia for data analysis
- Lab 3, linear regression and metrics
- Lab 4, classification with LDA, QDA, KNN and metrics
- Lab 5, k-folds cross validation
- Lab 6b, Ridge and Lasso regression
- Lab 8, Tree-based models
- Lab 9, SVM (partial)
- Lab 10, PCA and clustering (partial)
These are examples that are meant to show how MLJ can be used from loading data to producing a model. They assume familiarity with Machine Learning and MLJ.
Note that these tutorials are not meant to teach you ML or Data Science; there may be better ways to analyse the data, the primary aim is to show quick analysis so that you can get more familiar with using MLJ.
The examples can be followed in any order, the tags can guide you as to which tutorials you may want to look at first.
- Telco Churn (MLJ for Data Scientists in Two Hours), intermediate, classification, one-hot, ROC curves, confusion matrices, feature importance, feature selection, controlling iteration, tree booster, hyper-parameter optimization (tuning).
- AMES, simple, regression, one-hot, learning network, tuning, deterministic
- Wine, simple, classification, standardizer, PCA, knn, multinomial, pipeline
- Crabs XGB, simple, classification, xg-boost, tuning
- Horse, simple, classification, scientific type and autotype, missing values, imputation, one-hot, tuning
- King County Houses, simple, regression, scientific type, tuning, xg-boost
- Airfoil, simple, regression, random forest
- Boston LGBM, intermediate, regression, LightGBM
- Boston Flux, intermediate, regression, Flux, Neural Network
- Using GLM.jl, simple, regression
- Power Generation, simple, feature pre-processing, regression, temporal data
- Breast cancer, simple, model comparisons, binary classification