MetaNaviT: A Handy Tool for Resource Mapping & Extraction
Research and develop an AI-powered open-source tool designed to help a broad range of users organize, extract, and transform digital resources into customizable structured formats.
Objectives
Develop a “Resource Metadata Navigator/Transformer” (MetaNaviT) tool designed to index and manage resources from diverse sources such as local directories, AWS buckets, web content, and databases. MetaNaviT will leverage advanced AI/ML techniques and ETL (Extract, Transform, Load) processes to search, manipulate, and transform resources.
MetaNaviT will feature both a graphical user interface (metanavit-gui) and a command-line interface (metanavit-cli) for querying and interacting with indexed data. Key features and use cases include
- Cross-Resource Deep Copy Extraction: Extract specific data objects, such as a class from a codebase, along with all dependencies and related resources from various sources (e.g., local directories, AWS buckets), transform them into a usable format (e.g., Markdown), and load them into the desired environment.
- Resource-Agnostic Metadata Search: Extract paths or references across different resource types (e.g., images in AWS S3 buckets, web content, or files in a database) based on specific criteria (e.g., images containing a puppy), transform the search results by correlating them with relevant metadata (e.g., where the image is referenced in source code), and load the final data set for further use.
- Comprehensive Resource Mapping: Generate detailed resource maps spanning multiple resource types, transforming raw data into coherent representations. These maps will reveal novel connections uncovered by AI/ML techniques, surpassing traditional methods like AST parsers, and ideally serving as a superior drop-in replacement for tools such as tree-sitter for generating code repository maps.
- User-Specific Adaptability: Automatically adapt to user-specific organizational methodologies, accommodating various directory tree layouts and high-level organizational strategies, whether implicit or explicit. This ensures alignment with the user's existing data structure, simplifying information management and retrieval based on personalized organizational approaches.
MetaNaviT will empower users to efficiently retrieve, transform, and load structured metadata from a wide array of sources, facilitating seamless exploration and management of data across disparate systems, all tailored to individual organizational styles.
As MetaNaviT functions as an LLM assistant interface, it will offer all the standard features associated with LLMs (chat and instruct capabilities), enhanced with a deeper understanding of user resources beyond the current capabilities of naive RAG. It will also provide more powerful and integrated system-level functionalities than basic function calling.
NOTE: This open-source project will be developed on GitHub and will be language-agnostic. It will utilize the latest open-source models, such as LLama 3.1 405B and their fine-tunes, running on Nvidia hardware on Linux.
Motivations
Currently, no high-quality open-source tool exists to assist users in organizing, extracting, and transforming their digital resources—despite these tasks being extremely common and often labor-intensive across various professions and use cases.
Some examples:
- AI Coding: In large codebases, it is often necessary to prune the resources included in an LLM context to only those relevant to the current task. This is currently done using tools like tree-sitter and other AST crawlers, but these solutions lack context awareness and reasoning capabilities. As a result, they struggle to discover novel connections that are harder to detect, such as files referenced in comments or implicitly referenced classes via dynamic loading and reflection.
- Structured Data Preparation: An ornithology researcher may have numerous unorganized files containing observational notes and photos. This tool could assist in organizing these filesystem resources and automatically producing a curated dataset, extracted as JSON.
- Personal Data Management & Retrieval: The average user often has a myriad of media files and documents that are randomly organized, making them tedious to maintain and browse. This tool can help users efficiently organize and retrieve their data.
The development of a tool that can 'grok' even highly complex resource trees—and serve as a useful assistant in organizing, extracting, and transforming resource data—would be an invaluable asset for a diverse range of users.
Qualifications
Minimum Qualifications:
- A strong interest in AI/ML, data science, and ETL processes, especially in the context of managing diverse digital resources.
- A willingness to learn and apply advanced AI/ML techniques to solve complex, cross-resource data management challenges.
- Prompt engineering skills and construction of prompt workflows using function calling and RAG.
- Intermediate or better proficiency in programming CLI and GUI tools.
Details
Project Partner:
Galen St John
NDA/IPA:No Agreement Required
Number Groups:1
Project Status:Accepting Applicants
Keywords:Artificial Intelligence (AI)DatabaseMachine Learning (ML)Data EngineeringNew Product or Game