Skip to content

Outlook

This thesis presents the architecture, code structure, and development environment of the Datadive platform. Although the current state is far from a complete data analysis platform, it lays a foundation for future development. This chapter reviews the platform’s current status and outlines the next steps needed to make it fully functional for researchers.

Challenges and Limitations

Exploring and implementing a viable architectural framework for this thesis project proved complex and time-consuming. While the initial plan aimed for a comprehensive demonstration, time constraints ultimately limited the scope of functionality. The architecture design underwent several iterations, and the exploratory prototypes built to assess the architecture’s feasibility consumed more time than expected. Initially, the idea was to implement an event-driven architecture that would execute code in plugins communicating with the main application through a message broker. However, this approach was abandoned in favor of using Jupyter components for code execution. Although the event-driven architecture would have provided more flexibility, the Jupyter components offered a more straightforward solution. This reduced the platform’s complexity by offloading code execution to a well-established open-source project. Additionally, users can utilize the Jupyter Lab interface of the Jupyter servers, allowing them to revert to writing code in a Jupyter notebook if the Datadive GUI does not meet their needs.

The first prototype of a Jupyter-based architecture used a single Jupyter server running in a Docker container for code execution. While this approach worked well in a prototype, it did not allow for isolating user code execution in a production environment. The final architecture employs individual Jupyter servers managed by JupyterHub running in Kubernetes to execute code in isolated environments, providing a secure and scalable solution for code execution.

This architecture lays a solid foundation for the Datadive platform, offering a clear structure for future development. The repository containing the codebase is organized to facilitate ongoing development, featuring clear documentation, automatic code analysis, and a prototype that verifies the proposed solutions.

The prototype enables users to create projects and notebooks, as well as execute code using Jupyter components through the HTTP API and a limited frontend.

Next Steps

The immediate next steps for the Datadive platform involve enhancing the functionality of both the frontend and backend applications. The data model created by the migrations of the @datadive/db package includes all the tables necessary for the platform’s core functionality, such as user authentication, project management, and notebook execution. The API definition in the @datadive/spec package will be extended to include all necessary endpoints for this core functionality, and these endpoints will be implemented in the @datadive/api package. Another immediate step is to refactor the @datadive/auth package to replace the deprecated Lucia Auth package with a custom authentication solution, using Lucia Auth as a learning resource. [30] [#LUCIA_DEPRECATED]

Since the entire data model revolves around users “owning” projects that contain notebooks, it is essential to prioritize the implementation of endpoints for user management and authentication, as these endpoints form the foundation for all other functionalities. Once these endpoints are established, the next step will be to implement the endpoints for project management, which will allow users to create, update, and delete projects.

Subsequently, the endpoints for notebook management will be refactored to enable users to create, update, and delete notebooks, as well as execute code within them. To manage notebook content, cell template endpoints will be implemented, allowing users to create, update, and delete cell templates. The final step to complete the core interactions of Datadive will be to implement the endpoints for managing notebook content, enabling users to utilize cell templates and required input data to create, update, delete, and reorder cells in notebooks.

Another crucial step to get the Datadive platform production ready is implementing a custom authenticator for JupyterHub and figuring out an deployment strategy. The authenticator should regulate access to the Jupyter servers by authenticating users against the Datadive API, allowing only authenticated users to access the Jupyter servers [4]. This is crucial for the security of the platform, as it ensures that only authorized users can execute code in the platform and that user data is protected. The deployment strategy should include any the necessary configuration for a production Kubernetes cluster, such as setting up the JupyterHub Helm chart, configuring the JupyterHub authenticator and several other settings that are necessary for a production deployment of Jupyter Hub. It is also likely, that Datadive will need a custom Jupyter Server Docker image that includes configurations to allow installing additional packages and libraries or restrict access to certain parts of the Jupyter Lab interface to ensure that users can only use the features that won’t interfere with the Datadive platform [10].

The next steps for the frontend are less clear because both the design and, more importantly, user interactions require further exploration and defined requirements. The platform’s core interactions involve significant complexity. Designing a user-friendly “workspace” that allows users to manage notebook cells, execute code, view results, and switch seamlessly to an IDE-like Jupyter Lab interface is challenging. Implementing dynamically generated forms from server data for cell input is complex, especially if these forms need to include UX-enhancing features like client-side data validation or drafts to save progress. The next steps for the frontend should focus on exploring various designs and user interactions, defining clear requirements, and implementing the necessary components and pages to support the platform’s core interactions.

+++

Beyond the Core Functionality

Once the core functionality of the Datadive platform is implemented, the next steps involve extending the platform with additional features. These features should be based on the requirements identified in an upcoming conceptual phase of the Datadive project. Defining these features is beyond the scope of this thesis. However, some potential features that are likely to be needed or have been discussed during the conceptual phase of this thesis are listed in the following paragraphs.

One necessary feature is extending the platform with additional data import and export functionality. The current data model allows files to be stored on the Jupyter Server of their respective project, with references stored in the database. This approach is straightforward and effective if the files are used only within the project’s notebooks. However, the Datadive platform will likely include features that require access to data outside the notebook interface. For example, users may want to preview or edit data before adding it to a project, share data between projects or share data analysis results using public links. If the data is stored on a Jupyter Server, that server must always be running for data to be accessible. Running servers is resource-intensive, and keeping all servers operational at all times is likely not scalable. Additionally, starting and stopping servers requires resources and time, which means that accessing data by frequently starting and stopping servers could lead to slow response times and is probably not scalable either. Storing data outside of Jupyter servers may present additional challenges, such as accessing files within the Jupyter Server for code execution. One solution could involve including a custom ContentsManager class in the Datadive Jupyter Server Docker image. With a custom ContentsManager, files could be stored in a shared location, such as an S31 bucket, and accessed by the Jupyter Server through the Contents Manager and by the Datadive backend via the S3 API. [9]

Another feature under discussion is an “assistant” that helps users select the appropriate statistical test for data analysis based on the data’s shape and meta-information extracted from related sources, such as Qualtrics. This feature could be developed by expanding the concept of cell templates beyond just a collection of inputs and a code snippet with placeholders. Datadive could differentiate between complex and simple cells, with complex cells linked to hardcoded functionality that accesses project data and provides feedback to users. If this functionality were implemented generically, it could pave the way for additional features that assist users in completing data analysis tasks.

Other features that are needed are collaboration features, such as sharing projects, notebooks, and files not just with other users but also with external collaborators. Version control is another important feature that is likely to be needed. The current data model does not include versioning, which is crucial for reproducibility. Implementing version control will require changes to the data model and the backend API, as well as the frontend to support viewing and restoring previous versions of projects and notebooks.

Conclusion

The Datadive platform, though still in its early stages, establishes a foundational framework for developing a comprehensive data analysis tool. The project has made key architectural choices, such as leveraging Jupyter components and implementing a scalable infrastructure with JupyterHub and Kubernetes, which provide a solid basis for further development. While not all planned features could be implemented due to time constraints, the platform’s modular design and organized codebase support ongoing enhancements. The outlined steps for implementing core functionalities, including user authentication, project management, and notebook execution, are crucial for advancing the platform. Future work will focus on addressing data management challenges and exploring additional features to assist users in data analysis. Despite current limitations, this thesis lays the groundwork for future development efforts aimed at creating a functional research tool.

Footnotes

  1. Amazon S3 (Simple Storage Service) is a scalable object storage service offered by Amazon Web Services (AWS) designed for storing and retrieving any amount of data from anywhere on the web. It provides features such as durability, availability, and security for data storage, making it suitable for various applications including data backup, archiving, content distribution, and big data analytics. S3 organizes data into buckets, with each object identified by a unique key, enabling efficient data management and access.