Data pipeline to collect, analyse and store text messages from social media platforms.
Synopsis: a dockerized python application that collects text messages from X and Telegram, translates them using Azure AI Translator, classifies them using custom models and saves them in Argilla, which is used to validate the classifications.
- Ensure files
pyproject.toml
andpoetry.lock
available at root. The latter is for caching dependencies - ODBC Driver for SQL Server if storing data in Microsoft Azure SQL Server
- Install Python Poetry
- Edit config file to your need using the template
config-template.yaml
inconfig
folder. Save it asconfig.yaml
in the same folder. - Run command:
where
poetry run python -m telegram_pipeline --country <someCountry>
<someCountry>
is a country name in the yaml file.
- Install Docker
- Build the docker image from the root directory
docker build -t rodekruis/social-media-listening .
- Run the dockerised pipeline in 2 ways:
- With default configurations:
docker run -it rodekruis/social-media-listening --country <someCountry>
- Or enter the docker image interactively for more run options (such as running for specific countries one by one):
Then run in the opened container:
docker run -it --entrypoint /bin/bash rodekruis/social-media-listening
poetry run python -m telegram_pipeline --country <someCountry>
- With default configurations: