How to Collect Telegram Chat Member Names for Data Analysis

I recently delved into the fascinating world of parsing chats in Telegram and was surprised by how many repetitive questions people ask, the low level of understanding among those who need parsing, and the rampant scams and abuses from those offering such services. After observing this, I decided to figure it out on my own.

In this article, I’ll try to explain in a way that’s easy to understand, even for those who aren’t familiar with coding, what can and cannot be done and how labor-intensive the whole process is. I won’t be providing ready-to-use source code, but there will be small examples to illustrate the points.

As you may know, Telegram has chats and channels that can accumulate a large number of users. Having a list of these users can sometimes be quite useful, for example, for sending out newsletters or invitations.

In the context of Telegram, the term “parsing” usually refers to extracting a list of users from a channel or chat. Less commonly, it can also mean retrieving a list of messages.

Channels

Let’s start with channels. A channel in Telegram is a type of resource where users can only read messages from the channel owner. They cannot post messages themselves, except in cases where the channel has a linked comment chat. In that scenario, subscribers have the ability to comment on the owner’s messages.

You can obtain a list of subscribers from a channel without an attached comment chat only if it’s your own channel and it has fewer than 200 subscribers. If even one of these conditions is not met, technically speaking, parsing is impossible, and no promises can change that. There might be new methods in the future, either legal or exploiting loopholes, but as of now, there are no working solutions.

If the chat with comments exists, you can scrape the users in the same way as with any other chat.

Regarding the list of messages in a channel, you can access it either programmatically through the Telegram API or manually by exporting the message list using the standard client.

Chats

Chats present a more intriguing challenge. Extracting a list of users manually using the standard client is nearly impossible unless you’re prepared to jot down all the information you need with a pen and notebook. This isn’t very practical, so it’s better to turn to Telegram’s native API, or, to make things easier, use a library like Telethon.

In Telethon, there’s a function called GetParticipantsRequest, which takes an entity as input (entity) and outputs a list of users.

Let’s try feeding it a chat session.

async def test1(client):
    chat_id = 'https://t.me/kakoy-to-chat'
    chat_entity =  await client.get_entity(chat_id)
    participants =  await client(GetParticipantsRequest(
        chat_entity, ChannelParticipantsSearch(''), offset=0, limit=200, hash=0))
    for user in participants.users:
        print(user)
    return

Let’s see what can be achieved using this function:

User(id=306742xxx,
    is_self=False,
    contact=False,
    mutual_contact=False,
    deleted=False,
    bot=False,
    bot_chat_history=False,
    bot_nochats=False,
    verified=False,
    restricted=False,
    min=False,
    bot_inline_geo=False,
    support=False,
    scam=False,
    apply_min_photo=True,
    fake=False,
    access_hash=669983103xxxxx,
    first_name='??\u200d>?',
    last_name=None,
    username='prosto_user_name',
    phone=None,
    photo=UserProfilePhoto(photo_id=13174487829112xxxx,
    dc_id=2,
    has_video=False,
    stripped_thumb=b'\x01\x08\x08\x04\xe0\xaa\xe0\x8f\x9b\x8cQE\x14\x90\xcf'),
    status=UserStatusRecently(),
    bot_info_version=None,
    restriction_reason=[],
    bot_inline_placeholder=None,
    lang_code=None)

The most commonly required fields include id, username, first_name, last_name, and phone. Additionally, there are numerous attributes such as bot, verified, scam, fake, photo, status, and others.

As you can see, the information varies greatly. Some Telegram parsing specialists manage to claim that they only obtained IDs, while usernames and phone numbers come at an extra cost. Clever, to say the least!

Phones will only appear on this list if the user hasn’t disabled the option to display their phone to everyone in the settings.

By the way, it’s sometimes suggested to also determine a user’s gender. Telegram does not provide or have such data. I’m only aware of two ways to obtain this information:

Analyze usernames and real names by checking them against a pre-existing database to draw conclusions where possible. For instance, if a username is something like Karina, Julia, or Alena, one might assume it belongs to a woman.
Download all messages from chats for each user, extract the verbs, and determine how often they end with the letter “a”. It is logical to assume that instances of this occurring would be much more frequent in messages from women than from men.

It is clear that both methods provide no guarantees and only allow for the determination of gender with a certain level of probability. Additionally, they require extra effort.

When closely examining the output of GetParticipantsRequest, we can see that regardless of the number of chat participants or the limit parameter, it only returns a maximum of 200 users. This is sufficient when the group has fewer than 200 members. However, if there are more, additional effort will be needed.

My experiments with the offset parameter revealed that it’s used to specify an offset in the list of users. By default, this offset is set to zero, but if you implement a loop and increment the offset with each iteration, you can download 200 users at a time and parse almost indefinitely (or at least until you run out of users). For example, like this:

offset = 0
while True:
    participants = await client(GetParticipantsRequest(
        channel, ChannelParticipantsSearch(''), offset, limit, hash=0))
    if not participants.users:
        break
    #...
    # Here we do something with the users from the list participants.users
    #...
    offset += len(participants.users)

However, it quickly becomes apparent that the GetParticipantsRequest function returns a maximum of 10,000 users. So far, we haven’t figured out how to increase this limit. Some believe it might be impossible.

The filter parameter allows you to specify criteria that the returned results must meet.

Here are the options:

ChannelParticipantsAdmins;
ChannelParticipantsBanned;
ChannelParticipantsBots;
ChannelParticipantsContacts;
ChannelParticipantsKicked;
ChannelParticipantsMentions;
ChannelParticipantsRecent;
ChannelParticipantsSearch.

At this point, you can start experimenting, like trying to get a list of all admins or see who is currently online. Parsing online users is actually a great idea. By doing this regularly, you can filter out inactive members who joined the group and then forgot about it.

The parameter ChannelParticipantsSearch is what we should focus on, as it allows us to search for users by their username or part of it. Let’s try to set up a loop:

    chat_id = 'https://t.me/stepnru'
    chat_entity =  await client.get_entity(chat_id)
    keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U',
            'V', 'W', 'X', 'Y', 'Z']
    for key in keys:
        offset = 0
        participants = await client(GetParticipantsRequest(
                chat_entity, ChannelParticipantsSearch(key), offset, limit=200, hash=0))
        print(key + ": " + str(participants.count))

Let me explain: we went through the entire alphabet, checking each letter to find users whose user_name contains it.

Let’s see what we’ve got:

A: 28068
B: 11188
C: 5721
D: 15950
E: 7522
F: 5280
G: 8812
H: 4002
I: 9233
J: 3642
K: 15177
L: 8264
M: 20343
N: 10546
O: 5903
P: 9001
Q: 1009
R: 9882
S: 22445
T: 9881
U: 2376
V: 12249
W: 2581
X: 1749
Y: 4324
Z: 4283

As you can see, sometimes the results list contains fewer than 10,000 entries, in which case we can retrieve the entire list. Other times, it contains more than 10,000 entries, and then we can only access the first 10,000. However, a test conducted on a group of 190,000 users allowed us to gather data on 140,000 of them, which is quite substantial!

There are certainly other ways to experiment with filters and extract even more people from the chat. Consider it your homework assignment.

Please note: this method takes much longer, and parsing a group with several dozen users can take up to several dozen minutes.

I recommend saving results not in a text file, but in a database, such as SQLite:

def add_users_in_base(bd_name, users):
    sqlite_connection = sqlite3.connect(bd_name)
    cursor = sqlite_connection.cursor()
    for user in users:
        sqlite_insert_query = "INSERT INTO users (id, deleted, bot, bot_chat_history .....  phone) VALUES (?,?,?,?,?,?,?,?) "
        data_tuple = (
            user.id, user.deleted, user.bot, user.bot_chat_history, .... user.phone)
        try:
            cursor.execute(sqlite_insert_query, data_tuple)
        except sqlite3.Error as er:
            pass
        sqlite_connection.commit()
    cursor.close()
    sqlite_connection.close()

Duplicates are filtered out right from the start, making it much easier to work with the data afterwards—whether you’re searching, sorting, or converting it.

Conclusions

I’ve demonstrated how to extract information about 10,000 chat participants, and with the use of filters, you can handle even more. With some experimentation, you’ll be able to write scripts that collect the data you need in a format that’s convenient for you.

If you know any other interesting tips on this topic, don’t forget to share them in the comments!

2022.01.11 — Pentest in your own way. How to create a new testing methodology using OSCP and Hack The Box machines

Each aspiring pentester or information security enthusiast wants to advance at some point from reading exciting write-ups to practical tasks. How to do this in the best way…

Full article →

2022.06.03 — Playful Xamarin. Researching and hacking a C# mobile app

Java or Kotlin are not the only languages you can use to create apps for Android. C# programmers can develop mobile apps using the Xamarin open-source…

Full article →

2023.02.12 — Gateway Bleeding. Pentesting FHRP systems and hijacking network traffic

There are many ways to increase fault tolerance and reliability of corporate networks. Among other things, First Hop Redundancy Protocols (FHRP) are used for this…

Full article →

2022.06.02 — Blindfold game. Manage your Android smartphone via ABD

One day I encountered a technical issue: I had to put a phone connected to a single-board Raspberry Pi computer into the USB-tethering mode on boot. To do this,…

Full article →

2023.06.08 — Cold boot attack. Dumping RAM with a USB flash drive

Even if you take efforts to protect the safety of your data, don't attach sheets with passwords to the monitor, encrypt your hard drive, and always lock your…

Full article →

2022.06.02 — Climb the heap! Exploiting heap allocation problems

Some vulnerabilities originate from errors in the management of memory allocated on a heap. Exploitation of such weak spots is more complicated compared to 'regular' stack overflow; so,…

Full article →

2022.06.01 — Log4HELL! Everything you must know about Log4Shell

Up until recently, just a few people (aside from specialists) were aware of the Log4j logging utility. However, a vulnerability found in this library attracted to it…

Full article →

2022.02.09 — Dangerous developments: An overview of vulnerabilities in coding services

Development and workflow management tools represent an entire class of programs whose vulnerabilities and misconfigs can turn into a real trouble for a company using such software. For…

Full article →

2022.01.11 — Persistence cheatsheet. How to establish persistence on the target host and detect a compromise of your own system

Once you have got a shell on the target host, the first thing you have to do is make your presence in the system 'persistent'. In many real-life situations,…

Full article →

2023.02.21 — SIGMAlarity jump. How to use Sigma rules in Timesketch

Information security specialists use multiple tools to detect and track system events. In 2016, a new utility called Sigma appeared in their arsenal. Its numerous functions will…

Full article →