On this page, we present the data promised in Greenberg, Sayeed, and Demberg (2015, NAACL) and Greenberg, Demberg, and Sayeed (2015, CMCL) as well as Sayeed and Demberg (2014, CLIC-it).
Because the CMCL paper in which we promised this data just appeared online, we're putting up this preliminary page.
We will put up a better page in the future, and as we generate more datasets and code, that will start appearing. We will cite this web site in future papers when we use the data for a published result or produce new data that we intend to share.
These are the large data sets derived from the ukWaC corpus and BNC that we generated. Right now, we're providing them to you as index HDF5 files, for which you should use Pandas under Python to access and query. Later, we'll provide compressed text files.
We give these freely for your research, but if you plan to publish anything with these, we request that you tell us about it using the contact information below. Also, any publications that use them to obtain a result should cite the following papers:
These files are the result of the Amazon Mechanical Turk process we described in Greenberg, Demberg, and Sayeed (2015, CMCL). The first column is the verb tested, the second the noun, the third the role, and the last one, the score on a scale of 1 to 7.
As above, we offer this data for your use, but as above, we request that you contact us if you plan to publish anything using this data as well as cite the following paper: