Hier sind einige wissenschaftliche Publikationen aufgelistet, die ich im Rahmen meiner Promotion veröffentlicht habe.
Educational dashboards allow educators to gain insights about their students and their learning progress. It is essential to understand why students may drop out of the university. In our educational dashboard, we used a combination of Venn, Sankey, and UpSet diagrams to perform an in-depth analysis by investigating the effects of individual courses and their combinations. We present our visualizations based on student data from a computer science course at a German university.
This paper describes our participation in the LAK 2019 data challenge of predicting student performance. Given a student's clickstream data in the form of actions from the eBook system BookRoll, we predict the score of his or her final test at the end of the course. We propose a method called Bag of Behaviors (BoB) to transform a student's click data into a fixedsize vector by combining a k-Means clustering with localized soft-assignment coding. Using a random forest regressor, we achieve results that are comparable to other aggregation approaches.
This paper describes our participation in the task of predicting student performance at the learning analytics workshop which is hosted at the ICCE2018 conference. The task provides two datasets consisting of student time series click data behavior from an eBook reader. The goal is to predict the score and to predict whether a student passes the course or not. We transformed the time series data of student eBook actions in different features for the regression and the classification task. Among many feature subsets examined, feature subsets that have emerged through t-test, f-regression, and random forest regression have delivered comparatively better results. After an extensive feature engineering, we tried a new approach, based on k-Means, which transforms the selected features into the cluster-distance space. We evaluated the original and resulting features with different classifiers and regressors. For both datasets and both problems (regression and binary classification), the feature sets created with the cluster-distance space transformation have delivered better results.
This paper describes our participation in the SemEval-2018 Task 12 Argument Reasoning Comprehension Task which calls to develop systems that, given a reason and a claim, predict the correct warrant from two opposing options. We decided to use a deep learning architecture and combined 623 models with different hyperparameters into an ensemble. Our extensive analysis of our architecture and ensemble reveals that the decision to use an ensemble was suboptimal. Additionally, we benchmark a support vector machine as a baseline. Furthermore, we experimented with an alternative data split and achieved more stable results.
Cities and municipalities in Germany are more frequently using online participation projects to incorporate the opinion of their citizens into political decision-making processes. Citizens are able to voice their opinions, ideas, and comments in text form on online-based, forum-like platforms. The evaluation of these projects is conducted manually and can be very time consuming if the participants have written thousands of text contributions. In cooperation with technical service providers and cities as part of the PhD program Online Participation, we identified a need for automated approaches that assist in the manual evaluation.
First, we focused on argument mining. On the basis of the project Tempelhofer Feld, we identified a suitable argument model for online participation projects, annotated text content from a part of the project with three annotators, and achieved a high inter-annotator agreement. Then, we worked on the two machine learning tasks of automatically identifying argumentative content and classifying argument components. In our approach, we evaluated a classical machine learning approach with feature engineering as well as deep learning techniques.
Afterwards, we focused on online participation projects with a high number of text contributions and the task of automatically creating a broad overview of the discussion topics. We started by creating a new lemmatizer for German based on Wiktionary. After a fundamental debate about which text content should be considered for a topic extraction method and how the extracted topics should be visualized, we applied different topic extraction methods to several online participation projects and discussed their results.
Finally, we used text content from citizens involved in the discussion and dealt with the task of automatically inferring demographic attributes in order to identify underrepresented population strata. We developed a multi-lingual author profiling approach for the PAN author profiling challenge in 2016 and achieved first place out of 22 participating teams for gender detection in English text.
The 17th Conference on Database Systems for Business, Technology, and Web (BTW2017) of the German Informatics Society (GI) took place in March 2017 at the University of Stuttgart in Germany. A Data Science Challenge was organized for the first time at a BTW conference by the University of Stuttgart and Sponsor IBM. We challenged the participants to solve a data analysis task within one month and present their results at the BTW. In this article, we give an overview of the organizational process surrounding the Challenge, and introduce the task that the participants had to solve. In the subsequent sections, the final four competitor groups describe their approaches and results.
Online-Partizipationsverfahren werden in den letzten Jahren vermehrt von Städten und Gemeinden eingesetzt, um ihre Bürger in politische Entscheidungsprozesse einzubeziehen. Der vorliegende Beitrag beginnt mit einer Kategorisierung von Online-Partizipationsverfahren im politischen Kontext in Deutschland und fokussiert auf das Beteiligungsprojekt Tempelhofer Feld in Berlin. Dazu werden die Probleme einer manuellen Auswertung und die Notwendigkeit einer maschinell unterstützten Auswertung von Textbeiträgen aus Partizipationsverfahren beschrieben.\\ Im Beitrag wird auf die Probleme und Lösungsmöglichkeiten in den drei Analysebereichen Argument Mining, Themenextraktion und Erkennung von Emotionen eingegangen. Für den Bereich Argument Mining wird ein geeignetes dreiteiliges Argumentationsmodell, welches auf das Online-Partizipationsverfahren Tempelhofer Feld der Stadt Berlin angewendet wird, diskutiert. Zudem wird der Einsatz von word embeddings als Features für eine Support Vector Machine zur automatisierten Klassifikation von Argumentationskomponenten evaluiert. Anschließend wird ein Einblick in das Aufgabengebiet der Themenextraktion, dessen Ziel die Erstellung eines groben Überblicks über die diskutierten Themen eines Online-Partizipationsverfahrens ist, gegeben und die Ergebnisse zweier Verfahren werden diskutiert. Danach erfolgt eine Diskussion über die Einsatzmöglichkeiten einer automatisierten Emotionserkennung im Kontext von Online-Partizipationsverfahren.
Many events, for instance in sports, political events, and entertainment, happen all over the globe all the time. It is difficult and time consuming to notice all these events, even with the help of different news sites. We use tweets from Twitter to automatically extract information in order to understand hashtags of real-world events. In our paper, we focus on the topic identification of a hashtag, analyze the expressed positive, neutral, and negative sentiments of users, and further investigate the expressed emotions. We crawled English tweets from 24 hashtags and report initial investigation results.
We developed an approach to automatically predict the personality traits of Java developers based on their source code for the PR-SOCO challenge 2016. The challenge provides a data set consisting of source code with their associated developers’ personality traits (neuroticism, extraversion, openness, agreeableness, and conscientiousness). Our approach adapts features from the authorship identification domain and utilizes features that were specifically engineered for the PR-SOCO challenge. We experiment with two learning methods: linear regression and k-nearest neighbors regressor. The results are reported in terms of the Pearson product-moment correlation and root mean square error.
Given a set of sentences, a sentence orderer permutes the sentences in a way that the final text is linguistically coherent and semantically understandable. In this work, we focus on the binary and ternary tasks of ordering a pair of sentences regarding their linguistic coherence. We propose a methodology to automatically collect and annotate sentence ordering corpora in the news domain for English and German documents. Furthermore, we introduce a data-driven end-to-end neural architecture to learn the order of a pair of sentences and also recognize the cases where no ordering can be determined due to missing context.
Author masking is the task of paraphrasing a document so that its writing style no longer matches that of its original author. This task was introduced as part of the 2016 PAN Lab on Digital Text Forensics, for which a total of three research teams submitted their results. This work describes our methodology to evaluate the submitted obfuscation systems based on their safety, soundness and sensibleness. For the first two dimensions, we introduce automatic evaluation measures and for sensibleness we report our manual evaluation results.
Author profiling deals with the study of various profile dimensions of an author such as age and gender. This work describes our methodology proposed for the task of cross-genre author profiling at PAN 2016. We address gender and age prediction as a classification task and approach this problem by extracting stylistic and lexical features for training a logistic regression model. Furthermore, we report the effects of our cross-genre machine learning approach for the author profiling task. With our approach, we achieved the first place for gender detection in English and tied for second place in terms of joint accuracy. For Spanish, we tied for first place.
This paper focuses on the automated extraction of argument components from user content in the German online participation project Tempelhofer Feld. We adapt existing argumentation models into a new model for decision-oriented online participation. Our model consists of three categories: major positions, claims, and premises. We create a new German corpus for argument mining by annotating our dataset with our model. Afterwards, we focus on the two classification tasks of identifying argumentative sentences and predicting argument components in sentences. We achieve macro-averaged F1 measures of 69.77% and 68.5%, respectively.
This paper describes our participation in the SemEval-2016 Task 1: Semantic Textual Similarity (STS). We developed three methods for the English subtask (STS Core). The first method is unsupervised and uses WordNet and word2vec to measure a token-based overlap. In our second approach, we train a neural network on two features. The third method uses word2vec and LDA with regression splines.
Nowadays, there are a lot of natural language processing pipelines that are based on training data created by a few experts. This paper examines how the proliferation of the internet and its collaborative application possibilities can be practically used for NLP. For that purpose, we examine how the German version of Wiktionary can be used for a lemmatization task. We introduce IWNLP, an open-source parser for Wiktionary, that reimplements several MediaWiki markup language templates for conjugated verbs and declined adjectives. The lemmatization task is evaluated on three German corpora on which we compare our results with existing software for lemmatization. With Wiktionary as a resource, we obtain a high accuracy for the lemmatization of nouns and can even improve on the results of existing software for the lemmatization of nouns.
Bei der automatisierten Analyse von Textbeiträgen aus Online-Plattformen erfolgt oft eine Einteilung in positive und negative Aussagen. Bei der Analyse von Textbeiträgen eines kommunalen Online-Partizipationsverfahrens ist eine Aufteilung der geäußerten Meinungen in Kommunikationsmodi sinnvoll, um eine Filterung nach Argumenten und Emotionsäußerungen für nachfolgende Verarbeitungsschritte zu ermöglichen. In dieser Arbeit werden zwei Ansätze zur Erkennung von Kommunikationsmodi vorgestellt. Das erste Verfahren unterscheidet verschiedene Kommunikationsmodi anhand von Wortlisten. Die zweite Methode berücksichtigt Wortarten und extrahiert weitere sprachliche Eigenschaften. Zur Evaluation der Ansätze wird ein Datensatz aus Schlagzeilen von Nachrichtenartikeln der Internetseite ZEIT ONLINE und der Satire-Website Postillon erstellt. Die Ansätze werden zur Erkennung des Kommunikationsmodus Satire eingesetzt. Das beste Ergebnis mit einem durchschnittlichen F_1 von 75,5% wird durch den zweiten Ansatz mit einer Support Vector Machine erreicht.
Heutzutage haben Menschen die Möglichkeit, ihre Meinung zu verschiedensten Themen in onlinebasierten Diskussionsplattformen zu äußern. Diese Meinungen können in Form einer Meinungsbildungsanalyse genauer untersucht werden. In diesem Beitrag werden verschiedene Aspekte einer automatisierten Diskussionsverfolgung untersucht. Dazu werden Analysekriterien definiert und die vorgestellten Ansätze auf zwei deutschsprachige Datensätze angewendet.