Skip to content

Newsletter

Help Center

Categories
< All Topics
Print

Twitter Automated Collection Process

Access

You can access and clone the twitter crawler here :

https://github.com/COSMOS-UALR/TwitterCrawler

Highly recommended:  A great point of reference before executing crawler is the README file, which provides a high level of all functionalities within twitter crawler.

Purpose

The twitter crawler will extract data from Twitter by utilizing Twitter’s API which enables programmatic access to twitter’s data such as posts, users and followers. This data  will later be dumped into our CosmosDB or user files( txt, csv ) for different research purposes.

IMPORTANT

NEVER use an excel file to store twitter data. Excel natively converts fields containing only digits to a number field. This means that twitter ID such as ‘123456879’ will be permanently encoded in scientific notation. This leads to trailing zeros and other side effects.

Setup. 1 Basic Setup

  1. Create a Twitter Developers account at https://developer.twitter.com
  2. Clone the repo
  3. Install requirements, preferably in a virtual environment at the root of the project.

Setup. 2 Connecting The Server

To connect to a running Keys API server, you need to set the environment variable

API_KEY_HOST like so:

#Env Vars

import os

os.environ[“API_KEY_HOST”] = “144.167.34.18:9090”

This environment variable must be set before running the crawler. It must be the first thing the script runs. Your host (highlighted above)  may change overtime, so  If you are unable to connect, try looking into that VM to see if the server is running in task schedule

Setup. 3 Running The API Server

A vital part of the setup process is ensuring that your keys api and servers are running correctly. The management of Twitter API keys is very important. For us to collect twitter data, we must ensure we utilize all the twitter keys without abusing their timing limits. The README  file has an excellent explanation of this and it’s importance. Also learn more about twitter developer quota here Academic Research Quota

As shown below, there quotas limit on the amount of data researchers can access per monthly basis. This is something to bear in mind when running the twitter crawler.

Before we can crawl any data we need to initiate our keys api server. This is an essential step because if missed, our crawler would not fetch any data. This can be done by just running keys_api_server.py.

Running the script

Crawling twitter data is based on projects or sources. These projects are based on different research requirements for different research teams.

Example: There may be current research on posts about inflation rates and unemployment, so the mapped crawling project will be solely based on this research.

Crawling project are created by using the create_source.py and specifying :

  1. Source_id ,
  2. Name of project,
  3. Description.
  4. Date Range
  5. Search Query

This information is stored in the database under  the source table. Because twitter data is historical, your specified date range can never be in the future. For accuracy purposes, it is good practice to reference your newest or most recent date to the day before the current date.

E.g. If you want all data up to today, you should stop your data collection process up to the day before aka.  yesterday.

Your search query can be based on a set  of params namely  search, user and meta (which is a combination of both). Please refer to README  file for details on how to set up these queries.

Once you have created your project by running the create_source.py , you can run crawler.py which is the main file to run to orchestrate crawl jobs. Inside the main() function you can set which type of crawlers will run when the script runs.

Crawler.py provides access to the Botometer class and FriendsFollowers class for different ways of collecting bot-scores and friends and followers of s user. (friends = people the user is following).

Specific use-case and edge-cases (such as challenges with collecting data for the ASU teams)

Collecting data with twitter api is very case sensitive, ie. twitter api has a specific  language that it understands. Similar to sql, python , java etc. This means that when collecting data you will first need to translate whatever search you are specifying to Twitter api language into the cosmos twitter crawler. Below are the main key operators used  twitter API language:

LOGICWHAT YOU WANTTWITTER API TRANSLATION
AND specified by an empty spaceeg. Tom and John Tom  John
OR   specified by an ORTom or John Tom OR John  
QUOTATION specified when referencing verbatim phrasesHis name is Peter Maloney,  he is a SWE at AppleIf you want exactly Peter Maloney,  Your search will be ——-> His name is “Peter Maloney”  he is SWE Apple. NB: The power of quotation come in if you want the exact phrases so you do “His name is Peter Maloney,  he is a SWE at Apple” – for more specific results

Key logic operator helps us to cut down on extracting noisy data. More in-depth explanations:

https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

User Requirements

Understanding what users need for the research project is important when collecting any data.  As a means to limit the number of times we rerun twitter crawl for the same task, asking user requirement questions becomes important. Here are a few to get started.

  1. Why are we collecting this data?
  2. What exactly ( in specific terms) are we searching for? Is this a post search or user search
  3. Is this a search base or a key phrases search base? – this will let you know which search param is needed for your crawling job.
  4. When (time period ) are crawling for ?
  5. Is there anything specific we do not want to fetch ? eg. retweets

Once you have a thorough understanding of what your user needs are (user is the team, researcher that you are collecting data for ) you can then run the crawler.

Cosmos DB

      The crawler will pull data into the cosmos db(current connection) or any other database that it is connected to. Once the data is in the database you can perform simple sql  data extraction  queries to extract the data.  Below is a template on how to extract the data.

Here fields are the column you want from the database, table is table within your specified schema and conditions are a set of filters you want to apply to your search. This is great to add dates and any source_id that will help to truncate your search.

Here is a great tutorial link on mysql queries: https://www.tutorialspoint.com/mysql/mysql-where-clause.htm

SELECT field1, field2,…fieldN  from table WHERE condition1 AND [OR]] condition2

Sometimes running mysl can become slow depending on the complexity of your search. A good work around is connecting to sqlAlchemy with python to execute search.  Below is an example to set up connection to  mysql db by substituting your connection credentials.

Here is link for more info on sqlAlchemy doc:

https://docs.sqlalchemy.org/en/14/dialects/mysql.html

Python pandas and numpy are great resources to extract and manipulate data. Once you have filtered the data to your needs. You can export your data directly to csv, json, txt, or excel format. Text and json files are highly recommended as they preserve original formats.

Table of Contents

© 2024 Collaboratorium for Social Media and Online Behavioral Studies