Methodology

By Adam Hooper and Allison Fenichel

To build this story, we tackled two problems. First, we organized millions of users’ profiles into a special format built for searching. Second, we looked for words that stood out.

We examined the 18,686,752 Twitter users who followed Trump or Clinton at the time of our analysis. Of those, 7,972,079 users wrote bios: 4,232,803 Clinton followers and 5,078,544 Trump followers. There’s an overlap of 1,339,268 users with bios who follow both Clinton and Trump.

Organizing the profile information took weeks:

We downloaded every follower ID using Twitter’s API. (A follower ID is like a username, except shorter.) From the follower IDs, we downloaded each public Twitter profile. We began on Oct. 10 and finished on Oct. 18. (We assume the vast majority of Twitter followers didn’t edit their bios during our download.)
We tokenized each user’s bio, turning each string of text into a sequence of tokens. (“Working full time” becomes “Working”, “full”, “time”.)
We stemmed each token so that different spellings of the same word could be counted together. (“Working” becomes “work”.)
We counted how many Clinton and/or Trump followers wrote each stemmed token (e.g., “work”). Then we did the same for each pair of stemmed tokens (e.g., “work full”), each trio of stemmed tokens (“work full time”), and so on, up to each sequence of 10 stemmed tokens. We called these sequences of stemmed tokens groups. We published these group counts in our Venn diagrams. (So, “working full time” and “work full time” are both counted in the “work full time” Venn diagram.)
The search interface incorporates un-stemmed words, so if you search for “working full time,” you’ll get the same result. To build our search interface, we collected each variant of each group — that is, each unique sequence of characters, including punctuation and spacing — before we tokenized and stemmed the terms.
We discarded the variants that appeared in fewer than 100 bios, to preserve followers’ anonymity. This was case-insensitive: if some followers wrote “lgbt” and others wrote “LGBT”, we counted both sets of followers as “LGBT.” We discarded the groups that had no variants left.
For each group, we saved the number of Clinton followers, the number of Trump followers, the total number of variants (including the ones we didn’t keep), and the exact variants we kept. You can download our database.
Finally, we built the website’s search interface. It loads that database and lets you search for variants. From the variant it finds the group and it displays that group’s Venn diagram.

To find words that stood out — the terms we highlighted in the text of our story — we used the economics concept of an index. For each group, we calculated:

(number of Clinton followers who use the term ÷ total number of Clinton followers) ÷ (number of Trump followers who use the term ÷ total number of Trump followers)

When this formula gives a high result, we say a term “over-indexes” among Clinton followers. We also examined terms that over-index among Trump followers.

A term can over-index among Clinton followers even when the number of Trump followers who use it is higher. That’s because the total number of Trump followers is greater than the total number of Clinton followers.

The sentence below each Venn diagram describes this over-indexing: The followers who over-index are (index - 1) × 100% more likely to use a term than the followers who under-index.

We have published all our source code:

tweep-followers downloads large numbers of Twitter followers’ profile information.
twittok builds a database of groups and their variants.
we-the-tweeple is this website.

Back to the search tool »