This repository contains code and data for the paper "Relational Data Embeddings for Feature Enrichment with Background Information".
1) The folder KEN
contains the implementation of our approach, KEN, as described in the paper. It includes:
-
KEN/models/entity_embedding
classes for the TransE, DistMult and MuRE knowledge-graph embedding models (based on the PyKEEN package). -
KEN/models/numerical_embedding
classes implementing our approach (a linear layer with ReLU activation, seelinear2.py
) and a binning approach to embed numerical values (binning.py
). -
KEN/sampling/pseudo_type.py
an adaptation of PyKEEN PseudoTypedNegativeSampler, which replaces head entities by a random entity occuring in the same relation. -
KEN/training/hpp_trainer.py
a class to train embedding models with/without KEN, possibly with multiple hyperparameters. It also measures the time and memory needed for training, and save the results in a .parquet file. -
KEN/baselines/dfs.py
a class to perform Deep Feature Synthesis using the implementation from featuretools. It also measures the time/memory needed, and the number of generated features. -
KEN/evualation/prediction_scores.py
a set a functions to compute the cross-validation scores of embeddings / deep features on a target dataset. -
KEN/dataloader/dataloader.py
a class to load triples in the .npy format and convert them to a TriplesFactory object that can be used by PyKEEN. -
KEN/dataloader/make_triples.py
a function that takes as input tables/knowledge-graphs and turn them into a set triples saved with .npy format.
2) The folder experiments
contains the datasets and the code to run our experiments.
-
experiments/model_training
code to train embedding models (TransE, DistMult, MuRE, RDF2Vec), save them as checkpoints during training, and store metadata about checkpoints (parameters, time/memory complexity) in a .parquet file. -
experiments/deep_feature_synthesis
code to perform Deep Feature Synthesis, save the generated features, and store metadata (time/memory complexity, number of features) in a .parquet file. -
experiments/manual_feature_engineering
code to manually build features and store them in .parquet files. -
experiments/prediction_scores
code to compute cross-validation scores of all methods under study, and store the results (scores, time complexity) in .parquet files. -
experiments/attribute_reconstruction
code to compute cross-validation scores when reconstructing entities numerical attributes (e.g. county population) from their embeddings. We store the results in a .parquet file. -
experiments/embedding_visualization
code to visualize in 2D MuRE and MuRE + KEN embeddings trained on YAGO3. -
experiments/results_visualization
a set of functions to visualize the results of the experiments.
3) The datasets used in our experiments are available here in the form of a zip file.
The unzipped datasets
folder should be placed in experiments
.For each dataset xxx
, experiments/datasets/xxx
contains:
- a file
target.parquet
that contains the entities of interest (e.g. counties, cities) and the target to predict. - a folder
triplets
that contains the training triples in .npy format and their metadata.
- Install KEN using the
setup.py
file. - Run experiments (in order:
model_training
,deep_feature_synthesis
, thenprediction_scores
andattribute_reconstruction
) - To avoid re-running the experiments, we provide the result files used in the paper. You can visualize them with functions from
experiments/results_visualization/results_visualization.py
.