tidy_kerg

The German federal returning officer provides the election results in a csv file, alas, in a format which is not readily usable. To fix this I have written a script to turn the data into tidy format and uploaded it to my GitLab repository. I've done this a while ago but only know got around to writing about it.

Following the closure of polling stations Germany’s federal returning officer’(the Bundeswahlleiter) collates the polling results, announces the official result, and, important for social scientists and data journalists, provides them in machine readable format for download. Alas, the results are provided in a format defying conventions so that you cannot just read in the data and analyze or visulize it. The first row does not contain variable names but a heading and not every column is properly labelled. This why some data journalism folks refer to the file as a "csv"-file (as opposed to a proper csv-file).

The file is called kerg.csv, hence the name of this repository: tidy_kerg. This is how the first few lines and columns of the file look like:

# (c) Der Bundeswahlleiter, Wiesbaden 2017
#
# Wahl zum 19. Deutschen Bundestag (24. September 2017)
# Endgültige Ergebnisse der Erst- und Zweitstimmen sowie der Vorperiode nach Wahlkreisen
#
Nr;Gebiet;gehört zu;Wahlberechtigte;;;;Wähler;;;;Ungültige;;;;Gültige;;;;Christlich Demokratische Union Deutschlands;;;;Sozialdemokratische Partei Deutschlands;;;;DIE LINKE;;;;BÜNDNIS 90/DIE GRÜNEN;;;;Christlich-Soziale Union in Bayern e.V.;;;;Freie Demokratische Partei;;;;Alternative für Deutschland;;;;Piratenpartei Deutschland;;;;Nationaldemokratische Partei Deutschlands;;;;FREIE WÄHLER;;;;PARTEI MENSCH UMWELT TIERSCHUTZ;;;;Ökologisch-Demokratische Partei;;;;Partei für Arbeit, Rechtsstaat, Tierschutz, Elitenförderung und basisdemokratische Initiative;;;;Bayernpartei;;;;Ab jetzt...Demokratie durch Volksabstimmung;;;;Partei der Vernunft;;;;Marxistisch-Leninistische Partei Deutschlands;;;;Bürgerrechtsbewegung Solidarität;;;;Sozialistische Gleichheitspartei, Vierte Internationale;;;;DIE RECHTE;;;;Allianz Deutscher Demokraten;;;;Allianz für Menschenrechte, Tier- und Naturschutz;;;;bergpartei, die überpartei;;;;Bündnis Grundeinkommen;;;;DEMOKRATIE IN BEWEGUNG;;;;Deutsche Kommunistische Partei;;;;Deutsche Mitte;;;;Die Grauen – Für alle Generationen;;;;Die Urbane. Eine HipHop Partei;;;;Madgeburger Gartenpartei;;;;Menschliche Welt;;;;Partei der Humanisten;;;;Partei für Gesundheitsforschung;;;;V-Partei³ - Partei für Veränderung, Vegetarier und Veganer;;;;Bündnis C - Christen für Deutschland;;;;DIE EINHEIT;;;;Die Violetten;;;;Familien-Partei Deutschlands;;;;Feministische Partei DIE FRAUEN;;;;Mieterpartei;;;;Neue Liberale – Die Sozialliberalen;;;;UNABHÄNGIGE für bürgernahe Demokratie;;;;Übrige;;;;

It takes some cleaning up of the data before you can do actual analysis. Since it’s pretty wasteful if everybody takes this step by themselves I have made my code and resulting data publicly available and seems to have helped some people already is this Twitter thread indicates.

For making the data usable I have not just removed the leading rows and renamed columsn but also reshaped the data so that it conforms to the tidy data principles. Basically, the guiding principle is that every row should be an observation and every row a variable. Hadley Wickham has described tidy data in an article in the journal Journal of Statistical Software. In short, if you do not want to read the article, it is more or less what social scientist know as long format data (as opposed to wide).

To remove the first few rows, rename the columns and then turn the data into tidy format I use some packages from the tidyverse, mainly dplyr and tidyr to do this. In my dataset one observations is the result of a party for a vote (“Erststimme” or “Zweitstimme”) in a geographic unit in an election (2017 or 2013). Here’s a quick glimpse(sic!) at the data:

> glimpse(df)
Observations: 53,084
Variables: 15
$ nr         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11...
$ gebiet     <chr> "Flensburg – Schleswig", "Nordfri...
$ gehoertzu  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ land       <chr> "Schleswig-Holstein", "Schleswig-...
$ party      <chr> "Christlich Demokratische Union D...
$ partygroup <chr> "CDU/CSU", "CDU/CSU", "CDU/CSU", ...
$ vote       <chr> "Erststimme", "Erststimme", "Erst...
$ period     <int> 2017, 2017, 2017, 2017, 2017, 201...
$ votes      <int> 68120, 62256, 54812, 66625, 46560...
$ voteshare  <dbl> 40.0, 45.1, 41.9, 42.7, 30.7, 40....
$ voters     <int> 171914, 139194, 132017, 157354, 1...
$ turnout    <dbl> 75.2, 74.6, 74.7, 78.4, 74.9, 75....
$ eligible   <int> 228471, 186568, 176636, 200831, 2...
$ valid      <int> 170318, 137897, 130883, 156102, 1...
$ invalid    <int> 1596, 1297, 1134, 1252, 1594, 122...

You can then easily subset the data to obtain results of interest. Here, for illustrative purposes, I show the federal result (I provide further subsetting examples in the repository):

> df %>% filter(gebiet == 'Bundesgebiet',
+               period == 2017,
+               vote == 'Zweitstimme') %>%
+   group_by(partygroup) %>%
+   summarise(votes = sum(votes, na.rm = T),
+             valid = unique(valid)) %>%
+   mutate(voteshare = round(votes / valid * 100, 1))

# A tibble: 6 x 4
             partygroup   votes    valid voteshare
                  <chr>   <int>    <int>     <dbl>
1                   AfD 5878115 46515492      12.6
2 Bündnis 90/Die Grünen 4158400 46515492       8.9
3             Die Linke 4297270 46515492       9.2
4                   FDP 4999449 46515492      10.7
5              Sonstige 2325533 46515492       5.0
6                   SPD 9539381 46515492      20.5

I also cleaned the so called “Strukturdaten” (socioeconomic variables for the electoral districts and states) which are provided by the Bundeswahlleiter in a seperate file. My repository on GitLab contains the R code to download and format the data as well as the final product, the formatted data.

Did you like that post? You can suscribe to the RSS feed or follow @a_leininger on Twitter .
ipt>