From Entries to Insights: Building an AI-Powered Journal Assistant with RAG
I’m an avid journaler. For nearly 10 years, I’ve written at least a page in my journal every day. Even on late nights, or while camping, I brought my journal with me, and I made sure to jot down some quick thoughts. Each page is a reflection of the day’s events or the emotions and thoughts I experienced. After digitizing every entry, I’ve compiled a fairly detailed archive of my life over the last decade, capturing both my triumphs and also my challenges and fears.
Analyzing My Journal
As a Data Scientist, analyzing my journal has become an incredibly rewarding project. I’ve used sentiment analysis to explore the emotional tones of my entries. Notably, the year after graduating from university and moving to a new city for work featured the highest number of negative entries than any other period (it was a hard year!). This transitional phase is often challenging for many, so it’s neat to see my journal quantitatively reflect that experience.
Besides that, I’ve tracked word frequencies to spot recurring themes and trends — I’ve been playing less basketball as the years go by unfortunately. I’ve experimented with clustering, NER (Named Entity Recognition), topic modeling, word embeddings, and other NLP (Natural Language Processing) techniques. I even tried linking each entry to biometric data from my Apple Watch, thinking maybe my sleep or exercise habits influenced my journal’s mood. I found no correlation in that model — there are a lot more variables I’d need to collect. I’m constantly thinking of other ways I could extract more insights from my journal.
In an effort to streamline manually reviewing past journal entries, I developed an automated Python script I called my ‘Journal Digest,’ which emails me past journal entries from the same date across all previous years (like Google Photos ‘3 Years Ago’ photo memories feature). This has provided a fairly succinct snapshot into how my attitudes and fears have evolved over the years. While this tool has yielded valuable insights, sifting through these entries daily is still time-consuming. An AI assistant capable of processing multiple entries simultaneously to identify patterns would greatly improve the efficiency and depth of my analysis.
Enter RAG
With the development of Retrieval Augmented Generation (RAG) tools, which use documented knowledge to inform and customize responses from Large Language Models (LLMs), I realized my journal was an ideal candidate for creating this type of AI-Powered Journal Assistant. RAG is essentially just using an LLM like ChatGPT, but before you ask it a question, you augment the prompt to have all the relevant information that it needs, and then the LLM just has to read the context to answer the question.
An AI that is capable of reading through many entries added to a prompt could quickly synthesize information conveniently for me. Queries like
‘What were some of my proudest accomplishments from the summer of 2022?”
could easily be answered without requiring me to read months of entries. Through filtering the available documents (the retrieval step of RAG) and passing them through to an LLM — iteratively in rounds if there are a lot of filtered documents — I could achieve this affect.
OpenAI’s Custom GPTs
In November 2023, OpenAI released their commercial RAG feature called Custom GPTs. I eagerly built a private Custom GPT on OpenAI’s platform to read through my journal (maybe I’m too trusting of personal data on OpenAI, but that’s for another discussion). It was fascinating to see the LLM respond to my queries with information gathered from my journal. It can’t read all my journal entries at once, so it finds a way to filter to a select number, and send those through in the prompt. It’s able to synthesize information and spot some trends and patterns.
But the performance can be haphazard. Sometimes it completely hallucinates a response; I usually need to directly instruct it to reference my uploaded journal. In analyzing it’s retrieval process, I see that it uses simple keyword search or date filtering to sift through the journal to relevant entries. The app works well with simple queries with a short, definitive time frame. For instance, when I ask the following prompt I get a very coherent and useful response shown below (with personal events and persons removed):
Journal GPT Prompt:
What were some of my biggest fears in March 2020? Please include specific quotes from meaningful and significant entries. You need to reference the journal file. Read all the entries from March 2020 and then answer.
Journal GPT Response:
Reflecting deeply on your journal entries from March 2020, here are specific quotes and the context of your biggest fears and concerns during that pivotal month:
1. Realization of the Global Crisis:
— On March 11th, you noted: “Wow. Okay, the coronavirus is for real. Tom Hanks and Rita Wilson got it…” This entry captures the moment the pandemic’s reality hit home for you, marked by the news of celebrities contracting the virus, symbolizing the widespread and indiscriminate nature of the threat.2. Transition to Remote Work:
— By March 16th, the shift to working from home is mentioned: “Working from home! I got back into doing analytics…” This transition signifies the beginning of adapting to a new normal, underscoring concerns about productivity and the blurring lines between personal and professional life in isolation.3. Concerns About Social Isolation:
— Your entry on March 15th speaks to isolation: “Today I was an introvert. Did some introverted things…” This reflects the onset of social distancing measures and their immediate impact on your social interactions and mental well-being.4. Adapting to Changes and Uncertainty:
— On March 23rd, you described the overarching situation: “Another day under quarantine. COVID-19. It’s a bit surreal how the world has changed so fast.” This quote encapsulates the rapid changes in daily life and the uncertainty about the future, highlighting concerns about how long the pandemic would last and its long-term impacts.5. Seeking Normalcy Amidst Crisis:
— By March 31st, reflecting on the continuous adaptation, you said: “Another day under Corona. Today I made a little bit of progress on my personal projects…” This illustrates an attempt to find a sense of normalcy and productivity amidst ongoing crisis and uncertainty, indicating your resilience in facing prolonged challenges.These quotes and contexts from your journal entries show a journey through initial shock, adjustment to a new way of living and working, dealing with isolation, and striving for personal growth despite the circumstances. The entries reflect a period of significant change and adaptation, with a focus on maintaining mental health and seeking stability in the face of a global crisis.
Though a little verbose and generalized, this response and others have been interesting to read! However, after building custom RAG applications at work, I wanted to take a stab at building my own retrieval mechanism for a personally built AI Journal Assistant. If anything, it would help me explore the possibilities for retrieval and prompt augmentation.
Building My Own RAG Application
I decided to build my own RAG application to have greater control and transparency into the retrieval aspect of RAG. Filtering down a decade of potential journal entries is a unique challenge. As a Machine Learning Engineer at DHI Group (Dice.com) I’ve built custom RAG tools that use vector databases to store and enrich augmented prompts. Vector databases are optimized to store and retrieve documents very efficiently. Most RAG applications utilize vector databases for document retrieval. So I originally experimented with storing my journal in a locally hosted Qdrant database. Each entry was vectorized and stored for quick retrieval based on vector similarity. However, common queries for my journal assistant would be short questions like what I mentioned above:
“What were my biggest challenges in March 2020?”
These questions, once vectorized, don’t compare well to my entries. These queries yielded poor results when extracting the most ‘similar’ journal entries, which contain much more complicated ideas than a single question. I thought about storing each sentence in my journal as it’s own vector, so that the vector comparison might improve, but that sounded too complicated for right now, so I decided to store my journal as text based entries (similar to Custom GPT).
Two Types of Queries
I noticed that many of my queries fall into these two categories: date based queries, and non-date based. Questions like
“What were my biggest accomplishments of June 2022?”
present a clean date time frame to extract entries (like OpenAI’s Custom GPT did for March 2020). But questions like,
“When did I travel to Jordan with my friends Matt and AJJ?”
need to filter entries based on keywords because the date isn’t presented in the query. The commercial Custom GPT above performs really poorly at these keyword based queries. This was the area I wanted to focus on the most with my personalized RAG app.
First Step — Date or Keyword Search?
To separate these queries into the two categories, the first step in my personal RAG app workflow sends the query through a function to identify if the question is date based, or non-date based:
import re
def contains_date(text):
# Regular expression to match a four-digit year (e.g., 2020)
year_pattern = r'\b(19|20)\d{2}\b'
# Regular expression to match month names (full names)
month_pattern = (
r'\b(January|February|March|April|May|June|'
r'July|August|September|October|November|December)\b'
)
# Check if the text contains a year or a month name
if re.search(year_pattern, text) or re.search(month_pattern, text):
return True
else:
return False
Date Queries
If the query contains a date, then I simply filter my journal to the specified time frame. I use a call to an LLM to give me the time frame as a parsable JSON, since some of my queries might be more complicated to understand:
def extract_date_range(openai_client, query):
extract_prompt = """
From the following query, identify and extract
the date range mentioned in the query and
format it as a json. Ensure to cover various
time frames such as specific days, months,
years, or combinations thereof. Output the
extracted date range in a json structured
format that includes the start and end dates. Ex.
{
"start_date": {"year": 2022, "month": "July", "day": 1},
"end_date": {"year": 2022, "month": "July", "day": 31}
}
"""
meta_query = extract_prompt + \
"Query:" + query + \
"Date Range as json:"
final_answer = openai_client.chat.completions.create(
messages=[{"role": "user", "content": meta_query}],
model="gpt-3.5-turbo",
)
return final_answer.choices[0].message.content
This is the first in a chain of LLM calls, similar to a LangChain or Agent approach. Once the time frame is extracted and filters down my entries. I prepare my entries into batches to send to OpenAI.
OpenAI’s GPT-3.5-turbo model has a token limit of 4096 tokens, so I can only send about 12–15 journal entries at a time. If my filtered list contains 100 entries, then I need to chunk them into several batches and iteratively send them to OpenAI and get summaries for each section. Finally, I make a final LLM call to summarize the summarizations to answer the original query.
For example, when I ask:
What were my biggest accomplishments in May 2022?
There are two batches of entries that each get a summary (again, with some sections removed for privacy and brevity):
May 1st — 17th, 2022: Your biggest accomplishments in May 2022 were graduating from your Master’s program and purchasing a Subaru Crosstrek. You successfully presented your final paper on the artwork GAN and completed the final exam for your deep learning presentation. Furthermore, you played a significant role in the data deployment at work.
May 18th — 31st, 2022: Some of your biggest accomplishments in May 2022 include graduating with a Master’s degree, being valued and recognized as a value member on your team at work, and celebrating your birthday with Liz and friends. Work-related accomplishments include conducting interviews for your team, addressing bugs and finishing projects successfully.
Followed by the final summarization:
Your biggest accomplishments in May 2022 were graduating from your Master’s program and purchasing a Subaru Crosstrek, as well as successfully presenting your final paper on the artwork GAN and completing the final exam for your deep learning presentation. Additionally, you played a significant role in the data deployment at work.
This example only has two batches to summarize at the end because I wanted to show a short example. I’ve done examples with 6–8 months that have 13–15 batches. The performance can be inconsistent with large date internals, but it performs better than the Custom GPT from OpenAI which doesn’t have this chaining multi-call agent behavior implemented yet.
Also, in the interface for this AI Journal Assistant app, I display the batch summaries at the bottom of the page. If I have time or interest, I can read each summary. Or I can simply read the final summary. This personal approach gives me more transparency into the date-based queries.
Keyword Filtering
For questions without a time frame (the harder challenge), I need to rely on keyword based search. Before performing pure keyword search like the Custom GPT from OpenAI, I perform a call to an LLM to expand my keyword search to more potentially relevant terms.
def extract_keywords(openai_client, query):
extract_prompt = """
From the following query, identify the keywords
that would be useful to filter the journal
entries to, and only return those keywords as
a python list of strings. Identify useful
word derivatives and abbreviations.
Question: When did I read the book Invisible Women?
Keyword(s): ["Invisible Women"]
Ex. When was my graduation from UVA?
Keyword(s): ["graduation", "graduate", "grad", "UVA", "University of Virginia"]
"""
meta_query = extract_prompt + \
"Question:" + query + \
"Answer:"
final_answer = openai_client.chat.completions.create(
messages=[{"role": "user", "content": meta_query}],
model="gpt-3.5-turbo",
)
return final_answer.choices[0].message.content
Once I’ve collected word derivatives and abbreviations, I filter my journal to entries that contain at least two of the keywords — this is also an adjustable variable. I could set the keyword threshold to 3 or 4. If the 2 required keywords yields few entries, then the app reverts to just 1 keyword. These queries are normally looking for one specific journal entry, so we don’t need a large net of all entries, we just need the ones that could contain our answer.
To use the original non-date based query as an example:
When did I travel to Jordan with my friends Matt and AJJ?
The keyword expansion LLM call retrieves these keywords:
["travel", "traveled", "Jordan", "friends", "Matt", "AJJ"]
These keywords then filter my journal of over 3,500 entries down to 134 entries. Those 134 entries form 11 batches, from which the Journal Assistant correctly finds the date that I traveled with my friends: March 10th, 2019. Below is a screenshot from the Streamlit app of my Journal Assistant, which also shows the processing time, and how many total LLM calls it took in the chain to generate the final response (13 in this case, 1 call to expand the keywords, 11 calls to summarize each batch, and 1 to summarize the batches):
Though my personalized Journal Assistant app performs better than the Custom GPT at these keyword based searches, it still isn’t perfect. Depending on the query and the keywords obtained, the resulting answer doesn’t always make sense. But here I have control into how the keywords are generated and I could even tailor them to specific queries if I wanted.
As a future improvement, I could create a custom dictionary of alternative names to expand my keyword search. For example, if I have a friend that I sometimes call ‘James’’, sometimes ‘Jimmy’, sometimes a nickname ‘Jay’ in my journal then I could create an alternative names dictionary to expand my keywords to all of those names if I’m asking a query about him.
Here is the process flow for the full workflow of my AI Journal Assistant App:
Innovation Ahead
It’s been really enriching to experiment with the retrieval aspect of RAG in this project. I might revisit to further improve the performance and usability. The commercial RAG tools like Custom GPT and Amazon Q will also improve over time and become more accurate — with larger context windows, and agent behaviors — which will be exciting to follow.
I’m also passionate about one day integrating my journal data with a therapist minded AI (with privacy ensured) that could offer insights and ask probing questions to help me explore my past thoughts and feelings. Or even a program that can synthesize my journal info for a human therapist to reference. Still, manually reviewing my journal and processing the information myself offers a lot of benefit.
Experiment Yourself
If you’d like to use this RAG Journal Assistant with your own journal, feel free to check out this github repository with directions on how to kick off the Streamlit app with your own data. You could pull your journal data from apps like One Day or elsewhere and plug them into this app. Or feel free to experiment with the dummy data that’s in the repo. No need to upgrade to ChatGPT Plus to use the Custom GPT feature!
Thanks for reading and happy journaling!