2 minute read

TL;DR: This blog post explains how to retrieve the edit and comment history of Stack Overflow threads using the SOTorrent dataset.

Information about the most recent database layout and dataset version can be found on the project page.

Creating the Edit History Table

In our MSR 2018 paper, we describe a close relationship between post edits and comments. To support a further investigation of this relationship, we show how to create a new table EditHistory, which aggregates all title and body edits of Stack Overflow posts, together with post comments. Using this table, one can easily retrieve the edit and comment history of individual threads (see below). The BigQuery versions of the following queries can be found here.

CREATE TABLE EditHistory AS
SELECT *
FROM (
	SELECT
	  pv.PostId as PostId,
	  pv.PostTypeId as PostTypeId,
	  PostHistoryId as EventId,
	  CASE
		WHEN pv.PostHistoryTypeId=2 THEN "InitialBody"
		ELSE "BodyEdit"
	  END as Event,
	  UserId,
	  pv.CreationDate AS CreationDate
	FROM PostVersion pv
	JOIN PostHistory ph
	ON pv.PostHistoryId = ph.Id
	UNION ALL
	SELECT
	  tv.PostId as PostId,
	  tv.PostTypeId as PostTypeId,
	  PostHistoryId as EventId,
	  CASE
		WHEN tv.PostHistoryTypeId=1 THEN "InitialTitle"
		ELSE "TitleEdit"
	  END as Event,
	  UserId,
	  tv.CreationDate as CreationDate
	FROM TitleVersion tv
	JOIN PostHistory ph
	ON tv.PostHistoryId = ph.Id
	UNION ALL
	SELECT
	  PostId,
	  PostTypeId,
	  c.Id as EventId,
	  "Comment" AS Event,
	  UserId,
	  c.CreationDate as CreationDate
	FROM Comments c
	JOIN Posts p
	ON c.PostId = p.Id
) AS EditHistory;
ALTER TABLE EditHistory ADD INDEX EditHistoryPostIdIndex (PostId);
ALTER TABLE EditHistory ADD INDEX EditHistoryEventIdIndex (EventId);

Retrieving the History of Individual Threads

We first create a helper table that allows to retrieve the edit history of a thread using the PostId of either a question or an answer:

CREATE TABLE Threads AS
SELECT
  Id as PostId,
  PostTypeId,
  CASE
    WHEN PostTypeId=1 THEN Id
    WHEN PostTypeId=2 THEN ParentId
  END as ParentId
FROM Posts
# only consider questions and answers
WHERE PostTypeId=1
  OR PostTypeId=2; 
ALTER TABLE Threads ADD INDEX ThreadsPostIdIndex (PostId);
ALTER TABLE Threads ADD INDEX ThreadsParentIdIndex (ParentId);

Then, we retrieve the edit history of the Stack Overflow thread mentioned in this earlier blog post with the following query:

SELECT *
FROM EditHistory
WHERE PostId IN (
	SELECT PostId
	FROM Threads
	WHERE ParentId = (
	  SELECT ParentID
	  FROM Threads
	  # this is an answer id, the question id 3758606 yields the same result
	  WHERE PostId=3758880
	)
)
ORDER BY CreationDate;

The following figure shows the result of the above query (truncated):

Result of exemplary query

Possible Extensions

To retrieve the edit history on the level of individual text or code blocks, one could either modify the table creation query or join the InitialBody and BodyEdit events with table PostBlockVersion, using the PostHistoryId of the events.

More Information

First insights into the evolution of Stack Overflow posts, and in particular the connection of edits and comments, can be found in our research paper about SOTorrent. The dataset is available as a database dump (including import scripts) on Zenodo and as a public Google BigQuery dataset. Visit the project page to get the most recent version.

Updated: