Is PHP's community toxic?

So this got me curious, and i went looking into the data this article is based upon. Assuming i trust his data as is.

And i found something quite interesting.
The chart, while it looks pretty and all, is about half of his data. It’s very much not complete. In fact, if the bars were actually stacked in the way he seems to want them to be, here are some fun facts.

The full breadth of his data collection on “negative emotions” actually looks like this:


(Don’t ask me why verbose is a negative emotion.)

If I pair his data down to just the 4 ‘toxic words’ he selected from his list, but include ALL of the data, i get a chart that looks something like this:


Apparently the groovy guys and gals really like their fecal matter.

But this still isnt really reflective of the ‘community’, because as i mentioned previously, someone using all 4 words in one comment makes the community four times as toxic. So… I took his program, and analyzed it.

What he does is take every comment written in a subreddit, and compress it down into a SQLite database. He then runs the folllowing query:

"SELECT \
(10000.0 * COUNT(*) / cached_subreddit_comment_counts.cnt) as result\
FROM comments, cached_subreddit_comment_counts\
WHERE comments.subreddit = ? \
    AND cached_subreddit_comment_counts.subreddit = ? \
    AND body like ?"

With the ?'s bound as the subreddit identifier (twice), and the body as %<chosen word>%. So there’s some good numbers here - he counts each post once only per word (so “shit shit shit shit” doesnt get multi-counted).

However posts with multiple words in the data pool do get counted more than once - so using a stacked bar is somewhat misleading in his chart. Even the ‘sum’ column of his data is not really valid data - while factually accurate, it doesnt reflect a community to say that ‘there were 500 curses used in 10,000 posts’, for the reason i’ve stated above about multi-counting.

We can, however, infer the actual “number of posts containing a curse word” from his data, by tweaking his code slightly. So let’s stop trusting his data and do it ourselves; we can even do it with more recent data!

I added the following to the python file.

def relative_word_group_count(c, subreddit, words):
subwords = words.join("%\" OR body like \"%")
command =\
    "SELECT \
    (10000.0 * COUNT(*) / cached_subreddit_comment_counts.cnt) as result\
    FROM comments, cached_subreddit_comment_counts\
    WHERE comments.subreddit = ? \
        AND cached_subreddit_comment_counts.subreddit = ? \
        AND (body like \"%"+subwords+"%\")"
		
c.execute(command, (subreddit, subreddit))
res = c.fetchone()[0]
if not res:
    return 0
return int(res)	

def show_word_group_table(c, subreddits, words):
    print words
    result = ','.join(["subreddit"] + ["4big"] + ["all"])
    result += "\n"
    for subreddit in subreddits:
        print subreddit,
        result += subreddit + ","
		result += str(relative_word_group_count(c,subreddit,["shit","fuck","hate","crap"]))
		result += str(relative_word_group_count(c,subreddit,words))
        result += "\n"
    return result	

(and inside the count_word_mentions function)

write_str_to_file('analysis/words_group_all.csv', show_word_group_table(c, subreddits, negative_emotions))

Because this program pulls an entire YEAR’s worth of data out of reddit for all of the subreddits listed, it will take the program some time to run (the number of wget calls is absolutely insane). I will update this thread if/when it completes. (I’d put money on my PC crashing/rebooting before it comes close)

3 Likes