banner
niracler

niracler

长门大明神会梦到外星羊么?
github
email
steam_profiles
douban
nintendo switch
tg_channel
twitter_id

Export Telegram channel information and use text-embedding-ada-002 to embed channel text - v1

Ultimate Requirement: Create a knowledge base that belongs solely to me. This is just the first step of this ultimate requirement. This is merely a demo, and there are many more tasks to be done ahead.

(First, let's take a look at the final effect)
image

Step One - Export Telegram Channel Information#

Before you start, you need to ensure that Python 3 is installed. You will also need the following:

  • Telethon and PySocks libraries: You can install them using pip install telethon PySocks.
  • Make sure you are a member of the channel from which you want to retrieve messages.
  • A valid Telegram account to obtain the API ID and API hash for the Telegram application, which you can get at https://my.telegram.org. (Keep your API key secure and do not expose it in public repositories or settings.)
  • A proxy server (optional, if you are behind a firewall).

Code Implementation - Export Channel Information#

  1. Save the following Python script as telegram_to_csv.py:
import csv
import socks
from telethon import TelegramClient
from telethon.tl.functions.messages import GetHistoryRequest

# Set up TelegramClient and connect to the Telegram API
client = TelegramClient(
    'demo',
    'api_id',
    'api_hash',
    proxy=(socks.SOCKS5, '127.0.0.1', 1080)
)

async def export_to_csv(filename, fieldnames, data):
    """
    Export data to a CSV file.

    Parameters:
    filename -- Name of the export file
    fieldnames -- List of CSV header field names
    data -- List of dictionaries to export
    """
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

async def fetch_messages(channel_username):
    """
    Fetch all messages from the specified channel.

    Parameters:
    channel_username -- Username of the target channel
    """
    channel_entity = await client.get_input_entity(channel_username)
    offset_id = 0  # Initial message ID offset
    all_messages = []  # List to store all messages

    while True:
        # Request message history
        history = await client(GetHistoryRequest(
            peer=channel_entity,
            offset_id=offset_id,
            offset_date=None,
            add_offset=0,
            limit=100,  # Number of messages to request at a time
            max_id=0,
            min_id=0,
            hash=0
        ))
        if not history.messages:  # End loop when there are no more messages
            break

        for message in history.messages:
            if message.message:  # Only process messages with text content
                # Serialize message to dictionary form
                message_dict = {
                    'id': message.id,
                    'date': message.date.strftime('%Y-%m-%d %H:%M:%S'),
                    'text': message.message
                }
                all_messages.append(message_dict)
        offset_id = history.messages[-1].id
        print(f"Fetched messages: {len(all_messages)}")
    return all_messages

async def main():
    """
    Main program: Fetch messages from the specified channel and save to a CSV file.
    """
    await client.start()  # Start the Telegram client
    print("Client Created")

    channel_username = 'niracler_channel'  # Username of the Telegram channel you want to scrape
    all_messages = await fetch_messages(channel_username)  # Fetch messages

    # Define CSV file headers and export
    headers = ['id', 'date', 'text']
    await export_to_csv('channel_messages.csv', headers, all_messages)

# When this script is run as the main program
if __name__ == '__main__':
    client.loop.run_until_complete(main())

Run the Script telegram_to_csv.py#

Run the script in the terminal:

python telegram_to_csv.py

The script will start running and save all messages from the specified Telegram channel to a file named channel_messages.csv in the current directory.

After completing the above steps, you will find the text messages from the channel in the channel_messages.csv file, including the message ID, date, and content.

(The results won't be posted here~~)

Step Two - Use OpenAI's text-embedding-ada-002 Model for Text Embedding#

  • Install the openai and pandas libraries, which can be installed using pip install openai pandas.
  • A valid OpenAI API key.

Code Implementation - Embedding#

Save the following Python script as embedding_generator.py:

import pandas as pd
from openai import OpenAI

# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')

def get_embedding(text, model="text-embedding-ada-002"):
    """
    Get the embedding vector for the text.
    """
    text = text.replace("\n", " ")  # Clean newline characters from text
    response = client.embeddings.create(input=[text], model=model)  # Request embedding vector
    return response.data[0].embedding  # Extract and return the embedding vector

def embedding_gen():
    """
    Generate embedding vector data for tutorial text.
    """
    df = pd.read_csv('channel_messages.csv')  # Read CSV file into DataFrame
    df['text_with_date'] = df['date'] + " " + df['text']  # Concatenate date and text
    df['ada_embedding'] = df[:100].text_with_date.apply(get_embedding)  # Apply text embedding function in batches

    del df['text_with_date']  # Delete 'text_with_date' column
    df.to_csv('embedded_1k_reviews.csv', index=False)  # Save results to a new CSV file
    
    # Print the first few rows of the DataFrame for confirmation
    print(df.head())

# When the script is run directly
if __name__ == "__main__":
    embedding_gen()

Run the Script#

python embedding_generator.py
  • Install the pandas, numpy, and tabulate libraries, which can be installed using pip install pandas numpy tabulate.
  • The tabulate library is used to print the DataFrame in table format.

Save the following Python script as embedding_search.py:

import ast
import sys
import pandas as pd
import numpy as np
from tabulate import tabulate
from openai import OpenAI

# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')

def get_embedding(text, model="text-embedding-ada-002"):
    """
    Get the embedding vector for the text.
    """
    text = text.replace("\n", " ")  # Clean newline characters from text
    response = client.embeddings.create(input=[text], model=model)  # Request embedding vector
    return response.data[0].embedding  # Extract and return the embedding vector

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def embedding_search(query, df, model="text-embedding-ada-002"):
    """
    Use OpenAI API to search for embedding vectors.
    """
    query_embedding = get_embedding(query, model=model)  # Get embedding vector for the query text
    df['similarity'] = df.ada_embedding.apply(lambda x: cosine_similarity(ast.literal_eval(x), query_embedding))  # Calculate similarity
    df = df.sort_values(by='similarity', ascending=False)  # Sort by similarity in descending order
    df = df.drop(columns=['ada_embedding'])  # Remove embedding vector column
    return df

if __name__ == "__main__":
    df = pd.read_csv('embedded_1k_reviews.csv')  # Read CSV file into DataFrame
    query = sys.argv[1]
    df = embedding_search(query, df)  # Search for embedding vectors
    print(tabulate(df.head(10), headers='keys', tablefmt='psql'))  # Print the top 10 results

Run the Script - Results#

$ python embedding_search.py Animal Crossing
+------+---------------------+--------------------------------------------------------------+----------------+
|      | date                | text                                                         |   similarities |
|------+---------------------+--------------------------------------------------------------+----------------|
| 1041 | 2021-04-03 06:18:40 | Neil's Animal Crossing                                       |       0.843896 |
|  836 | 2021-10-16 02:37:16 | Animal Crossing Direct Chinese Video                         |       0.826405 |
|      |                     | https://www.youtube.com/watch?v=rI_jWfNd2dc                  |                |
| 1208 | 2019-11-10 00:05:56 | Raising animals seems very interesting                       |       0.822377 |
|  489 | 2023-06-16 09:33:15 | Watching the life of a kitten reminds me of Sisyphus in mythology |       0.802677 |
|  369 | 2023-08-16 02:15:54 | Do house cats get bored and lonely?                          |       0.797062 |
|   13 | 2023-12-14 13:17:59 | Attended 🤗                                                  |       0.796492 |
| 1177 | 2020-02-12 10:27:45 | The reason why people eat wild animals repeatedly is related to the deeply rooted concepts in traditional Chinese medicine |       0.796363 |
|      |                     | Health preservation, dietary therapy, supplementation, medicinal cuisine, shape complementing shape, nourishing qi and blood... |                |
|      |                     | Pseudoscience is reviving, and if not curbed now, similar things will happen in the future. |                |
|      |                     | Science is the only way.                                     |                |
|  801 | 2021-11-07 13:46:21 | I didn't expect that this year's game of the year would still be Animal Crossing and Fire Emblem. |       0.796246 |
|      |                     | Animal Crossing is because I didn't play enough before, Fire Emblem is because of a major event that made me want to replay it. |                |
|  837 | 2021-10-16 02:37:16 | No way, is my game of the year going to be Animal Crossing again? |       0.795871 |
|  423 | 2023-07-29 14:11:22 | A profile picture that can be called spiritual pollution~~   |       0.794144 |
+------+---------------------+--------------------------------------------------------------+----------------+

The Long Road Ahead - Many More Tasks to Do#

  • Vector Database: The bot can use this vector database for searching; using a CSV file each time is too inefficient. Considering using Cloudflare's vectorize. However, I want to first do a simple experiment to understand the process. After all, Cloudflare's paid plan is required to use vectorize, and I don't know if this feature will meet my needs.
  • Continuous Database Updates: Not only my channel but also my articles and other relevant data sources, even some channels I follow, and continuously update the database using a Telegram bot.
  • Prompt Engineering: When asking ChatGPT, I can find relevant content from this vector database and include it in the prompt to ask ChatGPT.
  • Basic Knowledge: I can't wait until I have enough basic knowledge to do these things; I should learn while doing. I have already completed a few steps that I understand, and I need to supplement the corresponding knowledge reserves later.
  • Improve Quality: Some low-quality content should not be included, and efforts should be made to reduce image-related content since images cannot be embedded.
  • Make it a CLI: Actually, this functionality is written in the Nayako CLI, but the code is not organized yet, and there is no exception handling, so I released it as a demo first. Posting such a long string of code here is also not very good.

References#

Postscript#

This is also an article with no technical content, just a record of some things I learned.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.