Skip to content

Commit ad4c429

Browse files
committed
major project clean up, structure refactoring, documentation updates
1 parent eecc49f commit ad4c429

25 files changed

Lines changed: 617 additions & 2239 deletions

docs/column_mapping.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# HMC Community Survey 2021 - Column Mapping Guide
2+
3+
This document provides a comprehensive mapping of the columns in the `responses_cleaned_mapped_to_publish.csv` dataset to their corresponding survey questions from the HMC Community Survey 2021.
4+
5+
## Survey Overview
6+
7+
The HMC Community Survey 2021 was conducted to understand research data management practices among researchers in the Helmholtz Association. The survey used a **dynamic questioning approach** where follow-up questions were shown based on previous answers, explaining the varying column counts per section.
8+
9+
## Dataset Summary
10+
11+
- **Actual columns in published CSV**: 263 columns
12+
- **Potential survey columns**: 305 columns (from survey design)
13+
- **Completed responses**: 631 responses
14+
- **Data file**: `responses_cleaned_mapped_to_publish.csv`
15+
16+
## Survey Question Groups and Column Mappings
17+
18+
### 1. **Personal Background (PERBG)** - 26 columns
19+
20+
Characterizes survey respondents by their institutional affiliation, research field, scientific discipline, career level, and research experience.
21+
22+
- `PERBG1/_` - Helmholtz center affiliation
23+
- `PERBG1/other` - Other center specification
24+
- `PERBG2/_` - Helmholtz research field
25+
- `PERBG3/_` - Primary research area
26+
- `PERBG3/other` - Other research area
27+
- `PERBG3AGRI/_`, `PERBG3AGRI/other` - Agricultural sciences
28+
- `PERBG3BIO/_`, `PERBG3BIO/other` - Biological sciences
29+
- `PERBG3CHEM/_`, `PERBG3CHEM/other` - Chemistry
30+
- `PERBG3GEO/_`, `PERBG3GEO/other` - Earth sciences
31+
- `PERBG3ING/_`, `PERBG3ING/other` - Engineering sciences
32+
- `PERBG3LIFE/_`, `PERBG3LIFE/other` - Life sciences
33+
- `PERBG3MATH/_`, `PERBG3MATH/other` - Mathematics
34+
- `PERBG3MED/_`, `PERBG3MED/other` - Medical sciences
35+
- `PERBG3PHYS/_`, `PERBG3PHYS/other` - Physics
36+
- `PERBG3PSYCH/_`, `PERBG3PSYCH/other` - Psychology
37+
- `PERBG4/_` - Years working in research
38+
- `PERBG6/_`, `PERBG6/other` - Career level
39+
- `PERBG7/_` - ORCID ID availability
40+
- `PERBG8/_` - Familiarity with FAIR data guidelines
41+
42+
### 2. **Research Data Properties (RSDP)** - 71 columns
43+
44+
Characterizes the research data generated or used by respondents, including data sources, methods, tools, and formats.
45+
46+
- `RSDP1/1A2`, `RSDP1/3A4` - Data origin (reused vs self-generated)
47+
- `RSDP1b/1` - Data origin (simulated vs experimental)
48+
- `RSDP1c/1` through `RSDP1c/11`, `RSDP1c/other` - Data generation methods (12 columns)
49+
- `RSDP2/1` through `RSDP2/6`, `RSDP2/other` - Data collection methods (7 columns)
50+
- `RSDP2b/1-1` through `RSDP2b/7-3` - Detailed data collection workflows (21 columns)
51+
- `RSDP3/1` through `RSDP3/15`, `RSDP3/other` - Data formats used (16 columns)
52+
- `RSDP4/_` - Data collection duration
53+
- `RSDP7/_` - Publication data volume estimation
54+
- `RSDP8/_` - Data processing time
55+
- `RSDP10/_` - Important software applications
56+
- `RSDP11/_` - Software application importance
57+
58+
### 3. **Research Data Management Practices (RDMPR)** - 86 columns
59+
60+
Focuses on research data storage routines, data annotation, documentation practices, and metadata handling.
61+
62+
- `RDMPR1/1` through `RDMPR1/3`, `RDMPR1/0`, `RDMPR1/other` - Data storage locations (5 columns)
63+
- `RDMPR3/1` through `RDMPR3/3`, `RDMPR3/other`, `RDMPR3/0` - Documentation methods (5 columns)
64+
- `RDMPR4/_` - Structured documentation (yes/no)
65+
- `RDMPR5/_` - International standards usage
66+
- `RDMPR6/1` through `RDMPR6/26`, `RDMPR6/other` - Metadata categories collected (27 columns)
67+
- `RDMPR7/2` through `RDMPR7/9`, `RDMPR7/other` - Digital metadata documentation (9 columns)
68+
- `RDMPR8/2` through `RDMPR8/10` - Automated metadata collection (9 columns)
69+
- `RDMPR9/2` through `RDMPR9/10` - Manual metadata collection (9 columns)
70+
- `RDMPR10/1` through `RDMPR10/3` - Structured documentation motivations (3 columns)
71+
- `RDMPR11/0` through `RDMPR11/9`, `RDMPR11/other` - Metadata collection obstacles (11 columns)
72+
- `RDMPR12/1` through `RDMPR12/6`, `RDMPR12/0`, `RDMPR12/other` - International standards used (8 columns)
73+
74+
### 4. **Data Publishing Practices (DTPUB)** - 62 columns
75+
76+
Addresses respondents' experience in making research data publicly available, including motivations and challenges.
77+
78+
- `DTPUB1b/1` through `DTPUB1b/3`, `DTPUB1b/other` - Data publishing methods (4 columns)
79+
- `DTPUB3/1` through `DTPUB3/7`, `DTPUB3/other` - Data publishing motivations (8 columns)
80+
- `DTPUB4a/0` through `DTPUB4a/7`, `DTPUB4a/other` - Data publishing obstacles (9 columns)
81+
- `DTPUB4b/0` through `DTPUB4b/7`, `DTPUB4b/other` - Barriers for non-publishers (9 columns)
82+
- `DTPUB5/1` through `DTPUB5/5` - Publishing percentage estimation (5 columns)
83+
- `DTPUB6/1` - Repository usage (1 column)
84+
- `DTPUB7/1`, `DTPUB7/21` through `DTPUB7/93`, `DTPUB7/other`, `DTPUB7/0` - Published metadata types (26 columns)
85+
86+
### 5. **Services and Support Needs (SERVC)** - 12 columns
87+
88+
Addresses respondents' perceived need for support in various topics of research data management and preferred service formats.
89+
90+
- `SERVC1/1` through `SERVC1/9`, `SERVC1/other`, `SERVC1/0` - Support needs areas (11 columns)
91+
- `SERVC2/1` through `SERVC2/6` - Service format preferences (6 columns)
92+
93+
### 6. **Technical/Administrative Columns** - 8 columns
94+
95+
System-generated fields for survey administration and analysis.
96+
97+
- `id` - Response identifier
98+
- `interviewtime/_` - Interview duration
99+
- `lastpage/_` - Last page reached in survey
100+
- `submitdate/_` - Submission timestamp
101+
102+
## Survey Logic and Adaptive Questioning
103+
104+
The survey implemented **conditional logic** where:
105+
- Questions were dynamically adapted to respondents' expertise levels
106+
- Follow-up questions appeared based on previous answers
107+
- Different paths were available for different experience levels
108+
- Not all respondents saw all questions
109+
110+
This explains why there were 305 possible columns in the survey design, but the published dataset contains only 263 columns after data cleaning and anonymization.
111+
112+
## Key Survey Focus Areas
113+
114+
The survey particularly focused on understanding:
115+
116+
1. **Current practices** in research data management
117+
2. **Metadata handling** and documentation approaches
118+
3. **Data publishing behaviors** and motivations
119+
4. **Support needs** for FAIR data implementation
120+
5. **Barriers and obstacles** researchers face
121+
6. **Community-specific requirements** across six Helmholtz research fields
122+
123+
## Research Fields Covered
124+
125+
The survey covered all six Helmholtz research fields:
126+
- Aeronautics, Space, and Transport (AST)
127+
- Earth and Environment (E&E)
128+
- Energy
129+
- Health
130+
- Information
131+
- Matter
132+
133+
## Data Collection Details
134+
135+
- **Survey Period**: September to November 2021
136+
- **Total Responses**: 631 completed responses
137+
- **Implementation**: LimeSurvey platform
138+
- **Data Collection**: Fully anonymized
139+
- **Target Group**: Scientific staff across all Helmholtz research centers
140+
141+
## Data Processing and Column Reduction
142+
143+
The published dataset contains **263 columns** rather than the full 305 possible columns from the survey design. This reduction occurred during data processing for the following reasons:
144+
145+
1. **Anonymization**: Institutional affiliation data and other identifying information was removed
146+
2. **Privacy protection**: Software names used by fewer than 4 respondents were anonymized
147+
3. **Data cleaning**: Empty or unused columns may have been filtered out
148+
4. **Conditional questions**: Some survey paths may not have generated responses, resulting in unused columns
149+
150+
The report specifically mentions: "Before the data publication the following information was removed or anonymized from the survey data in order to prevent the identification of individuals: Any information – including that might reveal a respondent's institutional affiliation, Names of software that is used by less than 4 respondents, Any information about institutional repositories."
151+
152+
## Usage Notes
153+
154+
- Column headers use a hierarchical naming convention (GROUP/SUBQUESTION/OPTION)
155+
- Multiple choice questions have separate columns for each option
156+
- Rating scales and slider questions have numeric values
157+
- Free text responses were cleaned and categorized where applicable
158+
- The `/_` suffix typically indicates single-choice or numeric responses
159+
- Numbered suffixes (e.g., `/1`, `/2`) indicate multiple choice options
160+
161+
This mapping enables researchers and analysts to understand the structure and content of the survey data for further analysis and visualization.

docs/refactoring/README.md

Lines changed: 117 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ This documentation repository contains comprehensive information about the major
8888

8989
---
9090

91-
## 🏗️ New Architecture Summary
91+
## 🏗️ Current Architecture Summary
9292

9393
### Before Refactoring
9494
```
@@ -99,15 +99,31 @@ main.py (1,198 lines)
9999
└── Monolithic structure
100100
```
101101

102-
### After Refactoring
102+
### Current Refactored Structure
103103
```
104104
survey_dashboard/
105-
├── config.py (124 lines) # Configuration & Constants
106-
├── data_processor.py (396 lines) # Data Operations
107-
├── widgets.py (168 lines) # UI Widget Creation
108-
├── visualizations.py (352 lines) # Chart Management
109-
├── layout_manager.py (192 lines) # Layout & Templates
110-
└── main.py (113 lines) # Clean Orchestration
105+
├── core/ # Core Business Logic
106+
│ ├── config.py # Configuration & Constants + HMC Colors
107+
│ ├── data.py # Data Operations & Processing
108+
│ └── charts.py # Chart Creation & Management
109+
├── ui/ # User Interface Layer
110+
│ ├── widgets.py # UI Widget Factory
111+
│ ├── layout.py # Layout Management
112+
│ └── callbacks.py # Interactive Callbacks
113+
├── i18n/ # Internationalization
114+
│ └── text_display.py # Multilingual Text Content
115+
├── hmc_layout/ # HMC-Specific Styling
116+
│ ├── hmc_colordicts.py # Official HMC Color Palettes
117+
│ ├── hmc_custom_layout.py # Custom CSS Styling
118+
│ ├── assets/ # SVG Icons & Graphics
119+
│ └── static/ # Static Web Assets
120+
├── data/ # Data Storage & Configuration
121+
│ ├── hcs_clean_dictionaries.py # Survey Data Mappings
122+
│ ├── *.csv # Survey Dataset Files
123+
│ └── *.json # Additional Data Files
124+
├── app.py # Main Application Entry Point
125+
├── analysis.py # Statistical Analysis Functions
126+
└── plots.py # Core Plotting Functions
111127
```
112128

113129
---
@@ -145,41 +161,85 @@ survey_dashboard/
145161

146162
## 🛠️ Module Responsibilities
147163

148-
### `config.py` - Configuration Hub
164+
### Core Business Logic (`core/`)
165+
166+
#### `core/config.py` - Configuration Hub
149167
- Global constants and environment variables
150-
- Color schemes and styling configuration
168+
- **HMC color schemes** and styling configuration
151169
- File paths and data source management
152170
- Widget options and template configuration
153171

154-
### `data_processor.py` - Data Operations
172+
#### `core/data.py` - Data Operations Engine
155173
- CSV loading and preprocessing
156174
- Question mapping and translation
157175
- Data filtering and aggregation
158176
- Statistical calculations and transformations
159177

160-
### `widgets.py` - UI Factory
161-
- Interactive widget creation
178+
#### `core/charts.py` - Chart Creation Manager
179+
- Overview, exploration, and correlation chart creation
180+
- Word cloud generation and management
181+
- Chart type selection and configuration
182+
- Visualization data preparation
183+
184+
### User Interface Layer (`ui/`)
185+
186+
#### `ui/widgets.py` - UI Widget Factory
187+
- Interactive widget creation (selectors, filters, controls)
162188
- Widget configuration and organization
163189
- Control group management
164190
- Panel component generation
165191

166-
### `visualizations.py` - Chart Engine
167-
- Chart creation and management
168-
- Visualization updates and callbacks
169-
- Word cloud generation
170-
- Interactive plot handling
171-
172-
### `layout_manager.py` - Layout Controller
173-
- Dashboard layout assembly
174-
- Template integration and configuration
175-
- Responsive design implementation
192+
#### `ui/layout.py` - Layout Manager
193+
- Dashboard layout assembly and template integration
194+
- Accordion structure and responsive design
176195
- Section organization and styling
196+
- Template variable management
197+
198+
#### `ui/callbacks.py` - Interactive Callbacks
199+
- Widget event handling and chart updates
200+
- User interaction management
201+
- Dynamic content updates
202+
203+
### Internationalization (`i18n/`)
204+
205+
#### `i18n/text_display.py` - Multilingual Content
206+
- Translatable text content (English/German)
207+
- UI labels and descriptions
208+
- Question text and tooltip content
209+
210+
### HMC-Specific Styling (`hmc_layout/`)
211+
212+
#### `hmc_layout/hmc_colordicts.py` - Official Color Palettes
213+
- **Helmholtz research hub colors** (Information, Health, Matter, etc.)
214+
- **HMC brand color palettes** for charts and visualizations
215+
- Color utility functions and matplotlib integration
216+
217+
#### `hmc_layout/hmc_custom_layout.py` - Custom CSS Styling
218+
- Accordion and card styling
219+
- Responsive design CSS
220+
- Panel component customization
221+
222+
### Data Layer (`data/`)
223+
224+
#### `data/hcs_clean_dictionaries.py` - Survey Data Configuration
225+
- Survey question mappings and translations
226+
- Data type specifications and validation
227+
- Multiple choice question handling
177228

178-
### `main.py` - Application Orchestrator
229+
### Application Entry Points
230+
231+
#### `app.py` - Main Application Entry Point
179232
- Component initialization and dependency injection
180233
- Callback registration and event wiring
181-
- Application startup flow
182-
- High-level coordination
234+
- Application startup flow and coordination
235+
236+
#### `analysis.py` - Statistical Analysis Functions
237+
- Cross-tabulation and statistical calculations
238+
- Data aggregation and transformation utilities
239+
240+
#### `plots.py` - Core Plotting Functions
241+
- Bokeh-based chart creation utilities
242+
- Plot styling and configuration helpers
183243

184244
---
185245

@@ -250,23 +310,52 @@ docs/refactoring/
250310

251311
---
252312

313+
## 🎨 Recent Improvements & Features
314+
315+
### HMC Branding Integration (September 2024)
316+
-**Official HMC Color Palettes** - Integrated Helmholtz research hub colors
317+
-**Chart Color Consistency** - All visualizations use official HMC branding
318+
-**Research Field Colors** - Specific colors for Information, Health, Matter, Energy, etc.
319+
-**Graceful Fallbacks** - Colors work with or without optional matplotlib dependencies
320+
321+
### File Organization Improvements
322+
-**Internationalization Structure** - Moved `text_display.py` to dedicated `i18n/` directory
323+
-**HMC Layout Consolidation** - All styling components in `hmc_layout/` directory
324+
-**Data Structure Cleanup** - Survey mappings properly organized in `data/` directory
325+
-**Import Path Fixes** - Updated all import statements for new structure
326+
327+
### Code Quality Enhancements
328+
-**Type Hints Added** - Improved static type checking with Pyright compatibility
329+
-**Error Handling** - Robust handling of optional dependencies
330+
-**Documentation Updates** - Comprehensive module documentation and examples
331+
332+
### Developer Experience
333+
-**Column Mapping Guide** - Detailed documentation of 263 CSV columns to survey questions
334+
-**Data Verification** - Confirmed 631 responses match HMC report specifications
335+
-**Color Utility Functions** - Easy-to-use functions for getting HMC colors in charts
336+
337+
---
338+
253339
## ❓ Frequently Asked Questions
254340

255341
### Q: Does the refactored version work exactly the same?
256-
**A:** Yes! All functionality is preserved. Users see no difference, but developers get a much better codebase.
342+
**A:** Yes! All functionality is preserved. Users see no difference, but developers get a much better codebase with official HMC branding.
257343

258344
### Q: Do I need to change deployment scripts?
259345
**A:** No changes needed. The same Panel serve command works exactly as before.
260346

261347
### Q: Can I still modify the dashboard?
262-
**A:** Yes, but it's now much easier! Check the [Developer Migration Guide](developer-migration-guide.md) for details.
348+
**A:** Yes, but it's now much easier! Check the [Developer Migration Guide](developer-migration-guide.md) for details. Plus you now have official HMC colors available.
263349

264350
### Q: What about performance?
265351
**A:** No performance impact. The modular structure may even be slightly faster due to better organization.
266352

267353
### Q: How do I add new features?
268354
**A:** Much easier now! Each type of change goes to its specific module. See the [Module Architecture](module-architecture.md) guide.
269355

356+
### Q: How do I use the new HMC colors?
357+
**A:** Import from `hmc_layout.hmc_colordicts` - colors are automatically applied to charts, or use `get_hmc_colors(n)` for custom visualizations.
358+
270359
---
271360

272361
## 📞 Support

0 commit comments

Comments
 (0)