COVID Stream Tweet IDs

Welcome to the COVID-19 Twitter Pandemic Archive. The archive is a catalog of datasets containing billions of Tweet IDs for COVID-19 related tweets and a set of data visualizations that display high-level monthly stats about the COVID-19 conversations on Twitter.

About the datasets

The datasets are being offered as-is for archiving and non-commercial research purposes and are free to download and reuse.

The tweets are collected via Twitter’s COVID-19 Streaming Endpoint (API) using a custom script developed by the Social Media Lab. According to Twitter, this new streaming endpoint has no data volume or throughput limitations, and offers a real-time, full-fidelity stream of public Tweets containing the full conversation about COVID-19. For more information about what tweets are included in this collection see Twitter’s filtering rules for this endpoint.

As per Twitter’s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month.

For each month, we prepare two data files:

one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and
a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).

Creating a random sample dataset from a massive dataset of Tweet IDs

Due to the large number of Tweet IDs (often 100M+) in each dataset in the archive, it is not always practical (or necessary) to recollect and study all of the tweets contained in the datasets. Instead, you can use our new Tweets Sampling Toolkit (available on GitHub) to create a random sample of Tweet IDs from one of the larger dataset available in the archive.

Comparing two or more datasets

In addition to creating a random sample, the Tweets Sampling Toolkit can also perform set operations such as union, difference, and intersection to compare two or more datasets. For example, if you have previously collected your own dataset of COVID-19 related tweets using Twitter’s Standard Search or Streaming API, you could compare it with one of the datasets published in the COVID-19 Twitter Pandemic Archive. This can be done using the “union” function provided in the Tweets Sampling Toolkit to merge two or more datasets of Tweet IDs, while excluding duplicates. Alternatively, you can use the “difference” function to identify and recollect only those tweets (based on their Tweet IDs) that are not part of your original dataset. Finally, you can use the “intersection” function, to locate Tweet IDs that appear in two or more datasets.

Rehydrating/Recollecting Tweets

The process of recollecting tweets (with their corresponding metadata) based on unique identifiers (Tweet IDs) is called Rehydration. To rehydrate tweets from one of the datasets in the COVID-19 Twitter Pandemic Archive (or a newly created random sample of Tweet IDs), you can use third-party programs such as Hydrator, the Python library Twarc, or Communalytic Pro (dataset limit of 10M Tweet IDs).

Using the data visualization dashboard to detect possible data lost and viral content

As part of the Covid-19 Twitter Pandemic Archive release, we also introduced a new public-facing data visualization dashboard that displays high-level monthly and daily stats about the COVID-19 discourse on Twitter. Along with some general stats about each dataset, the dashboard shows the hourly volume of data ingested by the harvesters for each month in the form of a time series chart. Here are the instructions on how to use and interpret the time series chart for the data visualization dashboard.

Downloads

About the datasets

Creating a random sample dataset from a massive dataset of Tweet IDs

Comparing two or more datasets

Rehydrating/Recollecting Tweets

Using the data visualization dashboard to detect possible data lost and viral content