Awesome Little Red Dots

An open-science initiative to centralize and analyze literature on "Little Red Dots" (LRDs)

Quick Start

Fully Automated Literature Collection

The core of this idea is automation. I wrote a set of Python scripts and entrusted them to the ever‑reliable GitHub Actions for orchestration. The workflow is straightforward:

  • Daily wake‑up: Every morning at 6:00 (UTC), GitHub Actions wakes up my scripts on schedule.

  • Intelligence gathering: The scripts access NASA’s Astrophysics Data System (ADS) via its API, search with predefined keywords (for example, “Little Red Dot”), and capture all newly appeared papers and preprints from the past 24 hours.

  • Information structuring: The scripts parse the harvested records, extract the title, authors, abstract, journal, DOI, and other key fields, and normalize everything into standard BibTeX format.

  • Auto‑archiving: Finally, the updated bibliography is automatically committed back to our GitHub repository.

Tagging Design

Simply collecting papers isn’t enough—a long list can still feel overwhelming. So I introduced Alibaba Cloud’s large language model “Qwen” to assign intelligent tags that make it easier to retrieve subtopics you care about.

I first defined a taxonomy of LRD research, covering subfields from observational properties to theoretical modeling. Then, whenever a new paper enters the library, our automated pipeline performs the following steps:

  • Reading comprehension: Send the paper’s title and abstract to the AI as “reading material.”

  • Smart labeling: Based on its understanding of the text, the AI selects the 1–5 most relevant tags from my taxonomy. For example, a paper analyzing LRD spectra may be tagged with spectroscopy or agn‑identification.

  • Data fusion: These AI‑generated tags are written into the paper’s BibTeX entry as part of its metadata.

Through this process, each paper is endowed with structured, filterable “identity information.”

Front‑End Presentation

With the automated dataflow and intelligent tagging in place, the final step is to present everything in a friendly way to researchers interested in LRDs. To that end, I built a static website with Jekyll that sorts the literature and presents authors, titles, abstracts, citation counts, and relevant links.

On this website, you can:

  • Full‑text search: Search any paper of interest by keywords in the title, authors, and abstract.

  • Tag filtering: Click tags in the sidebar—such as high‑redshift—and the site will instantly filter all LRD papers related to that topic.

  • One‑stop links: Each paper provides direct links to the official DOI page, ADS record, and arXiv preprint, making it easy to dive deeper. We also include a citation tag to reflect the level of discussion within the community.

  • Rich metadata: Beyond basic bibliographic information, the site also shows the paper’s abstract to help you quickly judge relevance.

Afterword

This is my small probe amid the wave of artificial intelligence—exploring how to leverage pretrained large language models to make my research more efficient while striving to keep accuracy front and center. Even so, I still hope that over the next two years I can learn to train neural networks that are truly fit for scientific use.

Research on “Little Red Dots” has been booming. Explanatory models keep emerging, and new papers roll in every month like a tidal wave—hard to keep up. In late 2024 I came across the idea of RAG. This workflow helps keep LLMs from hallucinating, and by feeding in reference literature, it effectively mitigates the knowledge‑staleness problem of LLMs. Given that LRD papers didn’t really surge until 2023—new in content but limited in quantity—LRDs are a perfect candidate for building a curated database with RAG. So in April this year I carved out some time and started this project.

Since this is a personal project with budget constraints, my initial plan was to build a local RAG service: parse PDF files locally, use a small model to re‑rank content based on user needs, and then hand off to a cloud LLM for final answers. To do this, I first tried setting up RagFlow locally with Docker, together with a free trial of one million tokens from Alibaba Cloud. But things didn’t go as smoothly as hoped. RagFlow was still very early (frankly, it still is). While it’s powerful at PDF analysis and chunking, my personal machine could barely handle building the knowledge graph, which severely weakened retrieval. My impression was that RagFlow encourages users to compose a bespoke knowledge‑processing pipeline using its built‑in Agent features—multiple AI models working on different parts for best results. But building Agents means repeatedly invoking different AIs within one Q&A flow, and without a mature Agent‑Flow marketplace, starting from scratch was beyond my ability at the time. As a result, the project stalled after the literature‑collection automation, and I published that repo first.

After I joined the Shanghai Observatory in July, I got a Mac Studio with 48 GB of unified memory as my workstation. Thanks to Apple’s unified memory architecture, Macs are surprisingly good at local LLM inference—rekindling my enthusiasm. I thought I could run a small local model to replace a provider’s API during development to keep costs down, and then switch back to a stronger cloud model for production. Reality had other ideas. While 48 GB can run many ~30B‑parameter models for inference and simple interactions, it’s far from enough to compute at maximum context length. For a RAG pipeline that relies heavily on long context windows, this was a showstopper—and a vivid reminder of how expensive RAG really is.

Luckily, there was a way out. Thank you, big‑tech Google, for offering NotebookLM for free. It’s almost everything I wanted: massive document chunking, huge context windows, perfect source attribution, and a reasonable degree of model/product customization. So we arrived at this project’s “final form”: use fully automated scripts to query ADS and consolidate bibliographic data into a BibTeX file; use Python scripts to download PDFs from arXiv; then manually upload them to NotebookLM to serve users.

Rather than a service just for studying LRDs, I hope this can serve as a demonstration. If you want to curate another topic, simply clone my repo, change the ADS query keywords, download the corresponding PDFs, and upload them to NotebookLM. If you’d like to share your curated database with the community, I’d be happy to add a link to your project in my repo’s README. This project would not exist without the astronomy community’s long‑standing culture of openness and sharing—something I’ve always been proud of as an astronomer. I hope this small tool injects a bit of new energy into that open‑source spirit.


快速开始

项目主页: https://www.wenkeren.com/Awesome-Little-Red-Dots/

GitHub 仓库: https://github.com/WenkeRen/Awesome-Little-Red-Dots

NotebookLM 仓库: https://notebooklm.google.com/notebook/717c4d8c-ba0f-496f-9e51-e603ee7ef10f

全自动文献收集

这个想法的实现,核心在于自动化。我编写了一套 Python 脚本,并把它交给万能的 GitHub Actions 来调度。这套系统的工作流程非常清晰:

  • 每日唤醒: 每天早上 6:00 (UTC),GitHub Actions 会准时唤醒我的脚本。

  • 情报搜集: 脚本会通过 API 访问 NASA 的天体物理数据系统(ADS),使用预设的关键词(比如 “Little Red Dot”)进行搜索,捕捉过去 24 小时内所有新出现的文章和预印本。

  • 信息整理: 脚本会自动解析搜集到的文献信息,提取标题、作者、摘要、期刊、DOI 等关键数据,并按照标准的 BibTeX 格式进行整理。

  • 自动归档: 最后,更新后的文献列表会自动提交到我们的 GitHub 仓库中。

标签设计

仅仅把文献收集起来还远远不够。一个长长的列表依然会让人望而生畏。因此我引入了阿里云的「通义千问」大语言模型,对文献实现智能化标签,方便用户检索自己关心的子课题。

我首先定义了一个关于 LRD 研究的分类「标签树」(Taxonomy),涵盖了从观测特性到理论模型等各个子领域。然后,每当有新文献入库时,我们的自动化流程就会执行以下操作:

  • 阅读理解: 将文章的标题和摘要作为“阅读材料”发送给 AI。

  • 智能贴标: AI 会根据对文本的理解,从我的标签树中挑选出最相关的 1–5 个标签。例如,一篇分析 LRD 光谱的文章可能会被贴上 spectroscopy、agn-identification 等标签。

  • 数据融合: 这些由 AI 生成的标签会被写进文献的 BibTeX 条目里,成为它的一部分元数据。

通过这个过程,每一篇文献都被赋予了结构化的、可供筛选的“身份信息”。

前端展示

有了自动化的数据流和智能化的标签系统,最后一步就是将这一切以最友好的方式呈现给所有对 LRD 感兴趣的研究者。为此,我使用 Jekyll 搭建了一个静态网站,对文献排序并呈现作者、标题、摘要、引用数以及相关链接。

在这个网站上,你可以:

  • 全文搜索: 通过关键词在标题、作者、摘要搜索任何你感兴趣的论文。

  • 标签筛选: 只需点击侧边栏的标签,比如 high-redshift,网站就会立刻为你筛选出所有与高红移相关的 LRD 文章。

  • 一站式链接: 每篇文献都提供了直达 DOI 官方页面、ADS 条目和 arXiv 预印本的链接,方便你深入阅读。同时我们还加入了 citation 标签,用于展示该工作在社区中的讨论热度。

  • 丰富的元数据: 除了基本的书目信息,网站还会展示文章的摘要,让你快速判断其内容是否与你相关。

写在后面

这是我在人工智能浪潮席卷之下的一次试探,利用预训练的大语言模型摸索如何在尽量确保准确性的前提下,让自己的科研更高效。尽管如此,我还是希望未来两年里,自己能学习并有能力训练真正用于科研用途的神经网络。

最近关于“小红点”的研究甚嚣尘上,各种各样的模型解释层出不穷,而且每月相关文章排山倒海般涌现,让人无力卒读。2024 年底我接触到了关于 RAG 的概念。这套流程看起来可以尽量避免 LLM 胡乱输出,而且通过输入参考文献,可以有效克服 LLM 知识库的时效问题。考虑到 LRD 的文章兴起不早于 2023 年、内容新但数量相对有限,是一个非常适合利用 RAG 制作数据库的选题,因此在今年四月份时我抽空启动了这个项目。

因为是个人项目,且考虑到成本,最开始我希望在本地电脑上搭建一套 RAG 服务,把 PDF 文件在本地分析好,使用一个小模型根据用户需求进行重排(re-ranking),再将结果上传给云服务的大模型。为此,我最初尝试在本地使用 Docker 搭建 RagFlow,搭配白嫖 阿里云 的一百万试用 token。然而事情往往没那么遂人愿。当时 RagFlow 还处于十分初期的软件(现在也类似),虽然其分析 PDF 文件的能力十分强大,可以很好地对文章进行分块提取,但我的个人电脑几乎不可能负担得起知识图谱的构建,使得知识检索能力大打折扣。我当时的理解是 RagFlow 希望大家使用软件内置的 Agent 功能构建出一套独特的知识处理流程,通过多个不同的 AI 模型分块工作,实现最优化的效果。然而制作 Agent 就意味着我需要在一个问答流程里反复调用不同的 AI,同时由于没有一个成熟的 Agent Flow 市场,从零开始搭建确实不是我的能力所及。因此项目只进行到了文献收集阶段,就几乎停滞了。于是我先上传了这个全自动文献收集更新的 repo。

七月份在上海台入职之后,我申请到了一台带有 48GB 内存的 Mac Studio 作为办公电脑。Mac 因其独特的统一内存设计,非常适合用来进行大模型推理,这又一次点燃了我对这个项目的热情。我原以为可以在本地运行一个简单模型替代云服务商的 API,借此节约调试 Agent 过程中的成本,待最终上线时再替换回更强大的云端大模型;然而残酷的事实又一次击败了我。48GB 内存确实可以运行很多约 30B 参数模型的推理并做一些简单交互,但却远远无法在其最大上下文长度下进行计算。这让极度依赖上下文长度的 RAG 项目再度陷入困难,也让我切身体会到使用 RAG 是一件多么昂贵的事情。

好在天无绝人之路,感恩大财阀 Google 为我等贫苦之人提供了免费的 NotebookLM 产品。这几乎就是我所想要的一切:超大规模的文本分割、超长的上下文输入、完善的来源标注,以及一定程度的模型与产品定制。于是我们来到了这个项目的「终点」:使用全自动脚本从 ADS 上检索并爬取文献信息汇总成 Bib 文件;使用 Python 脚本从 arXiv 上下载 PDF 文件;手动上传至 NotebookLM 并对外提供服务。

与其说这是一个服务大家学习研究 LRD 的项目,我更希望它能够成为一个示范项目。如果你想了解其他的知识,只需 clone 我的 repo,修改 ADS 检索关键词信息,下载对应的 PDF 文件并传给 NotebookLM 就好啦。如果你愿意和大家分享你构建的数据库,我也很乐意将你们的工程链接附在我的 repo README 中。这个项目的实现,离不开整个天文学社区长期以来形成的开放、共享氛围,这是我作为一名天文学者一直以来所骄傲的。我也希望这个小小的工具,能为这股开源精神注入一丝新的活力。