YouTube Automated Collection Process
This document is intended to help you setup the YouTube data collection tool on your local or virtual machine. The YouTube data collection tool utilizes YouTubes data API to get comments, videos and channel information. It is built with python and multi-threaded to increase throughput. An API key is used to authenticate each request to the API.
Before we continue, ensure you have the following setup on your system:
- IDE e.g., VS Code
- MySQL Workbench (optional)
- Get videos based on search keyword
- Get channels of videos based on search keywords
- Calculate daily engagement metrics starting from the initial crawl date
- Get related videos
- Crawl video and channel information by a list of given IDs
Clone the repository into your desired folder
- git clone https://github.com/COSMOS-UALR/YouTubeDataCollection.git
- cd into the repository
- cd YouTubeDataCollection
Create virtual environment from environment.yml file and activate the virtual environment
- conda env create -f environment.yml
- conda activate daily_crawler
Adding a task/crawl job
A task/crawl job is a project that contains run information required by the crawler and is mapped to actual research projects being conducted. This information is stored under the `task_test` table in the `crawler_task` schema of the COSMOS-DB. Below is the list of parameters used to define the crawl job.
- task_id: unique id of the project
- task_name: The name of the project
- channels: List of Channel IDs to crawl
- videos: List of Video IDs to crawl
- keywords: List of YouTube formatted search keywords to get videos relating to the keywords
- videos_daily: List of videos to get daily engagement and production metrics e.g No. of likes, views
- get_related_videos: Boolean to get related videos from crawled videos
- get_comments: Boolean to get comments on videos
- channel_by_keyword: Boolean to get channel information for videos that match search keywords
- crawled_from: only get contents published after the specified date
To add a task to the DB open create_task.py and fill in the above parameters -> then run:
Confirm that the task was successfully added to the database.
Checking Task Status
The crawler uses the column crawled_time on the crawler_task.task_test table on the COSMOS-DB to keep track of the last time a project is crawled. To check the progress of a crawl job, check the Log_<date>.txt file in the Logs folder to get the progress report on the crawl_job for the specified date.
Adding API Keys:
To add an API Key to the crawler, navigate to the API_Keys/API_Master_keys.txt file and add each key as a new line in the file.
Processing Crawl Jobs:
Execute `run_crawler.py` to process all the tasks added to the `crawler_task` table. The data collection tool runs in an asynchronous fashion. Depending on the parameters specified for a task when it was created, it would get the comments on a video, get new videos by search keywords, get all the videos on a channel, get metadata for specified videos, get channel information, and get channel and videos daily engagement stats.
If a task is being executed for the first time a new schema is created for that task/job in the COSMOS-DB using the task name as the schema name and creating the following tables:
The schema definition used to create each table can be found under the `Sql_template` folder.
The tables on the schema representing a task are populated with the crawled data.
There are two common problems that you could face with the crawler.
- Insufficient API keys to handle all the crawl jobs, thereby causing some tasks not to be started because the API keys have been exhausted.
- Irrelevant data from YouTube search queries, because the keywords passed, are ambiguous/not precise.