Ultimate Requirement: Create a knowledge base that belongs solely to me. This is just the first step of this ultimate requirement. This is merely a demo, and there are many more tasks to be done ahead.
(First, let's take a look at the final effect)
Step One - Export Telegram Channel Information#
Before you start, you need to ensure that Python 3 is installed. You will also need the following:
- Telethon and PySocks libraries: You can install them using
pip install telethon PySocks
. - Make sure you are a member of the channel from which you want to retrieve messages.
- A valid Telegram account to obtain the API ID and API hash for the Telegram application, which you can get at https://my.telegram.org. (Keep your API key secure and do not expose it in public repositories or settings.)
- A proxy server (optional, if you are behind a firewall).
Code Implementation - Export Channel Information#
- Save the following Python script as telegram_to_csv.py:
import csv
import socks
from telethon import TelegramClient
from telethon.tl.functions.messages import GetHistoryRequest
# Set up TelegramClient and connect to the Telegram API
client = TelegramClient(
proxy=(socks.SOCKS5, '', 1080)
async def export_to_csv(filename, fieldnames, data):
Export data to a CSV file.
filename -- Name of the export file
fieldnames -- List of CSV header field names
data -- List of dictionaries to export
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
async def fetch_messages(channel_username):
Fetch all messages from the specified channel.
channel_username -- Username of the target channel
channel_entity = await client.get_input_entity(channel_username)
offset_id = 0 # Initial message ID offset
all_messages = [] # List to store all messages
while True:
# Request message history
history = await client(GetHistoryRequest(
limit=100, # Number of messages to request at a time
if not history.messages: # End loop when there are no more messages
for message in history.messages:
if message.message: # Only process messages with text content
# Serialize message to dictionary form
message_dict = {
'id': message.id,
'date': message.date.strftime('%Y-%m-%d %H:%M:%S'),
'text': message.message
offset_id = history.messages[-1].id
print(f"Fetched messages: {len(all_messages)}")
return all_messages
async def main():
Main program: Fetch messages from the specified channel and save to a CSV file.
await client.start() # Start the Telegram client
print("Client Created")
channel_username = 'niracler_channel' # Username of the Telegram channel you want to scrape
all_messages = await fetch_messages(channel_username) # Fetch messages
# Define CSV file headers and export
headers = ['id', 'date', 'text']
await export_to_csv('channel_messages.csv', headers, all_messages)
# When this script is run as the main program
if __name__ == '__main__':
Run the Script telegram_to_csv.py#
Run the script in the terminal:
python telegram_to_csv.py
The script will start running and save all messages from the specified Telegram channel to a file named channel_messages.csv in the current directory.
After completing the above steps, you will find the text messages from the channel in the channel_messages.csv file, including the message ID, date, and content.
(The results won't be posted here~~)
Step Two - Use OpenAI's text-embedding-ada-002 Model for Text Embedding#
- Install the openai and pandas libraries, which can be installed using
pip install openai pandas
. - A valid OpenAI API key.
Code Implementation - Embedding#
Save the following Python script as embedding_generator.py:
import pandas as pd
from openai import OpenAI
# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')
def get_embedding(text, model="text-embedding-ada-002"):
Get the embedding vector for the text.
text = text.replace("\n", " ") # Clean newline characters from text
response = client.embeddings.create(input=[text], model=model) # Request embedding vector
return response.data[0].embedding # Extract and return the embedding vector
def embedding_gen():
Generate embedding vector data for tutorial text.
df = pd.read_csv('channel_messages.csv') # Read CSV file into DataFrame
df['text_with_date'] = df['date'] + " " + df['text'] # Concatenate date and text
df['ada_embedding'] = df[:100].text_with_date.apply(get_embedding) # Apply text embedding function in batches
del df['text_with_date'] # Delete 'text_with_date' column
df.to_csv('embedded_1k_reviews.csv', index=False) # Save results to a new CSV file
# Print the first few rows of the DataFrame for confirmation
# When the script is run directly
if __name__ == "__main__":
Run the Script#
python embedding_generator.py
Step Three - Perform Search#
- Install the pandas, numpy, and tabulate libraries, which can be installed using
pip install pandas numpy tabulate
. - The tabulate library is used to print the DataFrame in table format.
Code Implementation - Search#
Save the following Python script as embedding_search.py:
import ast
import sys
import pandas as pd
import numpy as np
from tabulate import tabulate
from openai import OpenAI
# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')
def get_embedding(text, model="text-embedding-ada-002"):
Get the embedding vector for the text.
text = text.replace("\n", " ") # Clean newline characters from text
response = client.embeddings.create(input=[text], model=model) # Request embedding vector
return response.data[0].embedding # Extract and return the embedding vector
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def embedding_search(query, df, model="text-embedding-ada-002"):
Use OpenAI API to search for embedding vectors.
query_embedding = get_embedding(query, model=model) # Get embedding vector for the query text
df['similarity'] = df.ada_embedding.apply(lambda x: cosine_similarity(ast.literal_eval(x), query_embedding)) # Calculate similarity
df = df.sort_values(by='similarity', ascending=False) # Sort by similarity in descending order
df = df.drop(columns=['ada_embedding']) # Remove embedding vector column
return df
if __name__ == "__main__":
df = pd.read_csv('embedded_1k_reviews.csv') # Read CSV file into DataFrame
query = sys.argv[1]
df = embedding_search(query, df) # Search for embedding vectors
print(tabulate(df.head(10), headers='keys', tablefmt='psql')) # Print the top 10 results
Run the Script - Results#
$ python embedding_search.py Animal Crossing
| | date | text | similarities |
| 1041 | 2021-04-03 06:18:40 | Neil's Animal Crossing | 0.843896 |
| 836 | 2021-10-16 02:37:16 | Animal Crossing Direct Chinese Video | 0.826405 |
| | | https://www.youtube.com/watch?v=rI_jWfNd2dc | |
| 1208 | 2019-11-10 00:05:56 | Raising animals seems very interesting | 0.822377 |
| 489 | 2023-06-16 09:33:15 | Watching the life of a kitten reminds me of Sisyphus in mythology | 0.802677 |
| 369 | 2023-08-16 02:15:54 | Do house cats get bored and lonely? | 0.797062 |
| 13 | 2023-12-14 13:17:59 | Attended 🤗 | 0.796492 |
| 1177 | 2020-02-12 10:27:45 | The reason why people eat wild animals repeatedly is related to the deeply rooted concepts in traditional Chinese medicine | 0.796363 |
| | | Health preservation, dietary therapy, supplementation, medicinal cuisine, shape complementing shape, nourishing qi and blood... | |
| | | Pseudoscience is reviving, and if not curbed now, similar things will happen in the future. | |
| | | Science is the only way. | |
| 801 | 2021-11-07 13:46:21 | I didn't expect that this year's game of the year would still be Animal Crossing and Fire Emblem. | 0.796246 |
| | | Animal Crossing is because I didn't play enough before, Fire Emblem is because of a major event that made me want to replay it. | |
| 837 | 2021-10-16 02:37:16 | No way, is my game of the year going to be Animal Crossing again? | 0.795871 |
| 423 | 2023-07-29 14:11:22 | A profile picture that can be called spiritual pollution~~ | 0.794144 |
The Long Road Ahead - Many More Tasks to Do#
- Vector Database: The bot can use this vector database for searching; using a CSV file each time is too inefficient. Considering using Cloudflare's vectorize. However, I want to first do a simple experiment to understand the process. After all, Cloudflare's paid plan is required to use vectorize, and I don't know if this feature will meet my needs.
- Continuous Database Updates: Not only my channel but also my articles and other relevant data sources, even some channels I follow, and continuously update the database using a Telegram bot.
- Prompt Engineering: When asking ChatGPT, I can find relevant content from this vector database and include it in the prompt to ask ChatGPT.
- Basic Knowledge: I can't wait until I have enough basic knowledge to do these things; I should learn while doing. I have already completed a few steps that I understand, and I need to supplement the corresponding knowledge reserves later.
- Improve Quality: Some low-quality content should not be included, and efforts should be made to reduce image-related content since images cannot be embedded.
- Make it a CLI: Actually, this functionality is written in the Nayako CLI, but the code is not organized yet, and there is no exception handling, so I released it as a demo first. Posting such a long string of code here is also not very good.
- Embedding paragraphs from my blog with E5-large-v2 - The original motivation for doing this was this article, but I basically just looked at the idea since I am directly using OpenAI's API for embedding, not a local model.
- Telethon Documentation - This is a Python wrapper for the Telegram API for personal accounts, as personal accounts use the MTProto protocol, so the necessity of using this library is quite high.
- OpenAI Embeddings Use Cases - I followed this example to do the embedding.
This is also an article with no technical content, just a record of some things I learned.