Laura A. Janda nominated for SPARC Europe’s Open Data Champion

posted in: Uncategorized | 0

Laura A. Janda has been nominated for SPARC Europe’s Open Data Champion.

Laura is a professor of Russian Linguistics at UiT The Arctic University of Norway. In 2013, she took the initiative to develop The Tromsø Repository of Language and Linguistics (TROLLing), an international archive for sharing linguistic data and statistical code. TROLLing, launched in June 2014, was developed – and is run – through a collaboration between the linguistics community and the UiT University Library. Laura tirelessly promotes open access to research data by presenting the advantages of TROLLing at numerous conferences and workshops, and by urging editors of scientific journals and other stakeholders to make data sharing a part of their policy. For more information about Laura, see here homepage. For more information about TROLLing, see the TROLLing homepage, and the About section on this blog.

As part of the nomination for SPARC Europe’s Open Data Champion, we have interviewed Laura about open data in linguistics. The interview is available on YouTube, and you may also read a transcribed version here:

Open data in linguistics

An interview with
Laura A. Janda, Professor of Russian Linguistics, UiT The Arctic University of Norway


SPARC Europe, an international foundation that advocates change in scholarly communications and Open Science, is building a showcase of some of Europe’s Open Data champions to help encourage others in research and research communication to do the same. Laura, you are one of our foremost Open Data advocates, and we would very much appreciate hearing your story, why you are so engaged in the Open Data movement.

Laura: I am a linguist and a researcher, and in the last fifteen years or so, I have seen linguistics as a science really change a lot. We have gained access to huge quantities of data as well as to very sophisticated software for analysing statistical tendencies, and this has led to a theoretical sea change in our community. We have discovered that many things we thought were very simple questions, with just yes/no answers and clear categorisations, many of these are statistical tendencies with many more factors at play, and as such much more complicated.


What experiences made you realise the importance of sharing data?

Laura: Around 2007, I went to a conference and I realised that I needed to learn statistics. I went back to my university and took courses in statistics, and since then I have written a textbook for linguists who want to use statistics, and I have developed a course here at UiT for linguists who want to use statistical methods in their research. I realised that one of the hardest things about learning to use statistics is figuring out what model fits your data, and it really helps to see examples of what other people have done. If I could see an example and see that it is similar to what I have done, then I could more easily relate to it. When I started, I didn’t take my courses in the Linguistics Department, but in the Psychology Department, with psychology professors as my teachers. Psychologists have been working with statistics much longer than we have, so they are much further along on this learning curve.

Another experience that has pushed me in the direction of open data is that, for many years, I have been the associate editor of our journal Cognitive Linguistics. Actually, I recently did a survey of all the articles that have been published in the journal since it was founded in 1990, to the present. Our journal has always been data-friendly, and there has never been an issue published that didn’t have a statistical analysis of data. However, around 2008, around the time when I realised that I needed to change myself, we crossed the 50% line for the first time: over 50% of the articles published in our journal involve statistics, and we are probably never going back. I don’t think we will ever go to 100%, but we are now very much dominated by statistical analyses of data. Also, I found that it’s a problem as an editor and as a reviewer – I review for many other journals – if you can’t see the data. So, it’s very important to provide access to the data so that others can see how it was done and learn from it, or even try to replicate it. In this way, we support the scientific method and the integrity of our field overall. It’s also important for transparency, to avoid fraud. We haven’t had any big scandals in linguistics the way that we have seen for instance in medicine, but it’s always possible for people to fudge their data a little bit. This is harder to do if the data is all made open and public.


How are you involved with Open Data?

Laura: The above are some of the reasons why I got started, and then I felt it would really help if we had one-stop shopping for linguists to find the data and the code, and learn about it. We got the idea of launching a website that would house those kinds of open data resources, and we went to our library. To our great delight, they thought this was a wonderful project and were willing to spend months – and even years ­– on this, taking care of many of the professional and technical sides of the questions, which would have been very difficult or impossible for me to tackle on my own, or even with my colleagues here in linguistics. So, this was very much a partnership, and we were very lucky that we had excellent colleagues in our library to help us out with this project.

Working with TROLLing[1] has also changed my own working habits. Because when you do a bunch of theoretical and statistical studies, after you have done a study and moved on to another one, maybe a year or two later, you want to go back and reuse some of that data, or take some inspiration from it, sometimes it’s hard to find your own data or even understand how it was put together, make sense of all the fields that you have in your files, if you haven’t annotated them well enough. Today, of course, I know exactly what all those fields mean, but will I know in a month, or in a year, in ten years? The nice thing about having a resource like TROLLing is that it really forces me, too, to upload all my data in a place where I can find it again, and I can show other people where to find it. Also, if I have gone through the exercise of annotating the data in a way that I hope makes it clear even for somebody who doesn’t know me and has no previous knowledge of my data, then, hopefully, it will be clear enough also for me when I go back to that data and look at it again. And it has become much better, and it’s way easier! Nowadays, it’s easier to go back to TROLLing to find my own data and code ­– and I know it’s always there, it’s safe – than to have to dig around in my own files.

I use my open data in teaching, too. I have a textbook that I use in my course, with some datasets and analyses for people to go through. But I have my own data, and there is something different about your own data, because you know it. I give my students a dataset for each type of statistical analysis they are supposed to learn. I give them my own dataset and my own code, and then we work through it. I can answer all their questions and really give them a full experience of what it’s like to work with your data and code. It’s kind of like a myth, I guess, I had to break free of in order to move into this new way of doing linguistics. Because it’s not like you can just collect data and shovel them over to some statistician; say the word “verb” and the shutters go down and he doesn’t understand. You have to analyse the data yourself, because the statistician will never understand it the way you do. Also, you have to have some idea of what the models are that you are going to use in the end, in order to collect the data that will be amenable to that kind of modelling in that kind of analysis.

One of my colleagues said, when we were making the instructional videos: “Laura, you have to make these instructional videos such that even your grandmother could upload data onto TROLLing”. I think we came pretty close to that. I think it’s pretty self-explanatory with the instructional videos. And I have always felt that research and teaching go hand in hand. I have never been involved in a research project that didn’t have some sort of teaching angle to it. And conversely, whenever I am teaching, I always try to think about what we still need to learn. And that is one of the great things about teaching: you see the students, you can see those gears turning in those heads, and you can see that they see it from a different perspective. They come up against a problem, they don’t understand why something works this way, and then you say: “We need a better explanation, we need to learn more about this phenomenon”. I learn constantly from the students, and that again can feed back into the teaching and research. It’s a continuous cycle.

So, the students are getting a simulated experience of hands-on working with the data. They get the data, they get the code, we go through it, we all sit there together, they all have their computers open, it’s like a hands-on experience of working directly with the data.


What do you consider to be Open Data concerns?

Laura: One thing has concerned me quite a bit recently. We have a challenge sometimes finding academic research positions for many of our graduates in linguistics. However, there are some corporations that are very interested in hiring statistically capable linguistics graduates. And these are mostly big corporations like Google, Amazon, Apple, Facebook and such. And these are the public ones, in the sense that everybody knows they exist. But they are doing a lot of clandestine research on you and me, using linguistics and big data, and everything that they do is kept undercover. That’s all company secrets. It’s spyware, let us put it that way. They are spying on us, they are using linguistics and data techniques in order to spy on us. And they are not alone. There are also various governmental organisations doing similar things, spyware operations. This is something that is pretty much unstoppable. It’s going to happen and we can’t prevent it. But the more that we put things out there ourselves and make things as public as possible, I think that is our only defence; that we have all these things in plain sight, and not let it all be shut behind the doors of spying operations and major corporations.


What inspires you and makes you optimistic about the future of Open Science?

Laura: I think that statistical studies and data studies in linguistics are here to stay. That’s definitely part of our future. I think that in the future, probably all linguistics programs will have courses in statistics for students, and that will be part of the expectations of submitting articles to journals. So, my hope for the future is that TROLLing will continue to be a clearinghouse for those materials, a place where people can upload their materials, share with each other and learn from each other. One never knows when one collects data what sort of structure in that data might have been overlooked, that somebody else could find. And that is one of the really exciting things about this time that we are living in, that suddenly we have access to so much data and a way to look for the structure, thanks to the sophisticated statistical software. I think we are living in very exciting times in that sense.

I want to mention the dissertation by Jaap Kamphuis, that was defended in Leiden. I had met the author at a couple of conferences before, so I knew approximately what he was working on and he knew something about what I was working on. Then I was asked to be an examiner at his dissertation defense. I got a copy and I was reading through it, and then I realised that he had taken the method that we had used, and he had gotten it from TROLLing, from our open data site. He had taken that method and used it on different data, and used it in a different way, and it was so exciting I practically cried! It was just a really exciting moment. This is the kind of thing that can happen, and this wouldn’t have happened if it wasn’t for TROLLing. He might have read my article, but then he probably would have said: “Well, I don’t know how to do this. How am I going to figure it out?”. But the thing is, he did this without having to call me or ask me or anything. He just went to TROLLing and downloaded it, and saw how it was done and said: “Yeah, I can do the same”, and did the same and wrote his dissertation.


What still needs to be done to get more people to share and open up their research data?

Laura: A big challenge is to educate people so that they understand that everybody gains, that nobody really loses anything. And that is also one of the things that we have safeguarded in TROLLing. We have instructions on how to cite the data, and once you put up your data in TROLLing, then everybody will always know that that was your data, because your name was on it first and we have the posting dates and all that information. You can’t lose anything, all you can gain is more perspectives from more researchers and maybe more interest in your research.


Finally, could you mention one important positive consequence of data sharing?

Laura: I mentioned that, in psychology, they have been doing statistical analyses for a long time, and in linguistics we have come to this rather late. But that means that we are in this formative period where we are discovering what are the methods that are going to work best for us. By sharing our data, and doing this in a very sort of open, public, community fashion, we can really decide what are best practices in our field, and help our whole field move forward by setting standards. I think that is really important.


Laura, on behalf of the open data community, thank you very much for your inspiring story!


Helene N. Andreassen, PhD
On behalf of the TROLLing team
University Library, UiT The Arctic University of Norway


You Tube:

[1] The Tromsø Repository of Language and Linguistics (TROLLing):