Project MapLemon is a corpus for stylometric demographic identification of 54,000+ words across 346 participants originally created to obtain a baseline corpus for linguistic variation among North American English speakers. The corpus contains responses from 30 linguistic backgrounds, and 40 US states and 6+ Canadian provinces. Project MapLemon has innovated a new method for data collection for linguistic variants in the natural, digital written word. MapLemon utilizes a hand-drawn map and asking participants to give directions via this map, as well as asking participants for a recipe for lemonade. In addition to its novel collection methods, MapLemon contains responses from 212 transgender and non-binary people; analysis of which has shown that transgender people write most similarly to their sex assigned at birth, then to their gender, and are dissimilar to other opposite-sex transgender people in their writing. Furthermore, the analysis suggests that Non-Binary people are their own gender category and cannot be classed with any other gender.

An example poster, presented at Text as Data '22, can be seen below.

MapLemon data is available on GitHub

MapLemon is supported by the EViL Lab at Duquesne University, and the Provost's Digital Innovation Grant.

Recent slides are available further explaining how MapLemon works.

