High-Performance Library for Big Data Analysis and Visualization (2024-25)
In this project, a high-performance library will be developed to process arbitrarily large data sets (>terabytes), while ensuring numerical stability (="must give correct result"), minimizing file handling (="one-pass algorithm"), and leveraging parallelism to the maximum extent possible to achieve highest performance and “just-in-time” visualization. We will be working with file formats that are optimized for this purpose and data that originates from the domain of “side-channel analysis”. This is time-series measurement data with, e.g., up to 5 billion points per trace (the waveform recorded by an oscilloscope). The next steps are: find a way to dynamically plot (with point-and-click zooming, etc.) such a time-series recording which may include techniques such as down-sampling, computing the envelope over multiple traces, computing the mean/variance, and to offer on-the-fly plotting with higher-resolution of smaller time-windows, in addition to FFT over time-ranges, and selection of “points-of-interests” to compute histograms or other statistical properties “vertically” across same time-points of multiple traces. Not all data will be nicely aligned, and as a next step, you would implement alignment algorithms that perform either automated pattern-matching, or based on manually selected points that are used as template.
Objectives
The goal is a publishable project that offers high-performance and fulfills the requirements for big data analysis and visualization. While some of the requirements are specific to the domain of side-channel analysis, the techniques and concepts learned will be beneficial for many other data-oriented tasks. Please have a look at: https://github.com/vaexio/vaex/issues/653#issuecomment-605049785 to see how such a data visualizer may look like. We want to create a similar user-experience while considering specifics of the side-channel analysis domain.
Motivations
Many rely on built-in library functions to compute statistical properties or to visualize the corresponding data. However, what to do when the input data is larger than the available memory of your computer? Can you still do mean(data set) or plot(data set)? Will the result still be numerically stable? How many times do you need to read the data before a result can be computed? What is the maximum data set size the library can handle to plot/visualize a result before running out of memory? All of this is surprisingly often not considered or documented in readily-available libraries. In addition, some data formats are implemented in Python such that the GIL of Python is locked. Consequently, accessing the same file with multiple threads is not always possible and performance is diminished. If you do not know what the GIL is: please go and look it up.
Qualifications
Minimum Qualifications:
- Background in parallel programming concepts, in particular:
- vectorization (SIMD)
- parallel threading/processing
- coroutines (async I/O)
- Basic knowledge of numpy and Jupyter
- Experience with developing and deploying Python projects
- Understanding of object-oriented software engineering patterns
- Linting, testing, packaging, etc. to build a professional software project
- Interest for security related topics such as cryptography
In addition to minimum qualifications:
- Strong background in numpy, Jupyter, ipywidgets
- Basic understanding of ECE-related math ("what does a Fourier transform do?", “time vs. frequency representation”, “what is a filter”, etc.)
Details
Project Partner:
Vincent Immler
NDA/IPA:No Agreement Required
Number Groups:2
Project Status:Accepting Applicants
Website:https://github.com/vaexio/vaex/issues/653#issuecomment-605049785
Keywords:
PythonResearchData ScienceBig DataData EngineeringJupyter