Data Analysis
The development and application of data analysis and exploration methods is typically closely tied to the application domain in which the data are generated and used. In Heidelberg, this takes place in a rich interdisciplinary setting, where research groups from different disciplines, ranging from the natural sciences, such as medicine, biology, and physics, to the social sciences and humanities, are collaborating with members of our research groups. The focus in these collaborations for the data analysis group is primarily in the areas of visual data analysis and text analysis.
Text Analysis and Exploration
It is estimated that 80% of the data in businesses are present in the form of text data, primarily as PDF files, Word documents, Web pages, and plain text files. Such unstructured data are often accompanied with structured data, e.g., spreadsheets, tables in databases or numerical data in some data management system, leading to intricate and latent linkages among heterogeneous forms of data. Extracting information from these data lakes satisfying some information needs and presenting results to users in a comprehensive and understandable way remains a difficult problem, given the complexity and heterogeneity of unstructured data. To address these challenges, we develop novel approaches that range from advanced information retrieval techniques to information extraction and linkage approaches that support text analysis tasks such as dynamic topic detection, text clustering, and classification, leading to more advanced approaches such as text summarization and generation. Respective research and development activities, which are typically conducted in collaborative research projects, employ traditional machine learning techniques as well as novel approaches developed in deep learning and natural language processing, respectively.
Data Management
Large-scale data analysis tasks typically rely on efficient and scalable data management infrastructures. In our research and development, we almost always make use of hybrid system infrastructures that combine traditional relational database management systems for structured data with systems that offer efficient access to unstructured data, in particular text data. This includes systems for graph data (e.g., to manage information networks) and systems to manage and query large text corpora, on top of which novel information retrieval and exploration approaches are realized, for example, for the construction and visualization of dynamic information networks. A particular challenge is the real-time processing of text data streams, resulting from, e.g., streams of social media postings or online news articles. In such settings, efficient NLP techniques are key for downstream text analysis tasks such as topic or trend detection.
Visual Data Analysis
Knowledge acquisition from large and complex data faces various challenges, necessitating integrated approaches. On the one hand, the data have to be transformed and reduced to their “essential structure” with respect to a (varying) research question. On the other hand, this structure has to be made available for reasoning. Since the human visual system is the perceptual channel with highest bandwidth, visual representation and interactive exploration have become primary tools in data analysis. Visualization research addresses this data transformation and (re-)presentation, and nowadays splits into three subdisciplines: Information visualization focuses on the analysis of discrete data, which typically involves discrete visual representations such as graphs. Scientific visualization, on the other hand, focuses on continuous data, and typically leads to continuous representations such as streamlines of a flow field. Finally, visual analytics focuses on interaction and human-in-the-loop aspects, which are particularly useful for the analysis of “big data”. We conduct basic research in all these areas, often in an interdisciplinary context, and always include a focus on the application of the achieved techniques and concepts. Recently, our research interest is extending to the analysis of continuous data in higher dimensions, to the analysis of scientific computing techniques that are used to simulate data, and to the understanding of mathematical structures.
Research Group Leaders
Institute for Computer Science
Data Science, Text Analysis, Natural Language Processing, Network Analysis, Data Management
Interdisciplinary Center for Scientific Computing (IWR)
Geometry, Visualization, Digital Humanities
Interdisciplinary Center for Scientific Computing (IWR)
Visual Data Science, Visualization, Feature Extraction, Dynamical Systems