02-24-2022 07:00 AM
Hello,
I have a csv file with a list of strings in the first column and their corresponding groups in the second column.
There are many strings that appear multiple times in different groups.
I would like to draw a heatmap where I can identify the proportion of each group occurring in the N group.
Thus, I will have the groups in the x and y axes and a diagonal equal to 1
For each intersection of the heatmap, I have the proportion of strings of group i of x-axis occurring in group j in the y axis.
Thanks,
Solved! Go to Solution.
02-24-2022 11:01 AM
An example of your data would help. How many groups? Also how the output should look like.
Maybe maps would help. What have you tried?
02-24-2022 11:38 AM
It looks a bit complicated.
I have 23 groups of strings
An example:
Jake 1
Alex 2
Jake 2
Mike 1
Olga 2
Chris 22
Olga 1
...
Then in x axis and y axis I will have the groups from 1 to 23. Let's say we have 20 people in group 1 and 30 people in group 2
The first item (group of x-axis) is 1. So the proportion is 20/20 (because all the items in group 1 are obvioly in group one) but they may exist in other groups
the second item (group of x axis) is 2, So the proportion is 1/30 because 1 item only are in group2 (jake is included in group 1 and 2). The question to ask is how many items from group 1 are in group 2
So this is to see how the people are partitioned in the different group, sometimes one person belong to multiple groups
02-24-2022 12:14 PM
Your csv file has no resemblance to your problem description. There are quite a few commas and there is no obvious delimiter (A plain comma gives quite a few columns and If we use <",>, e.g. the fifth line has a problem getting the group because the string is not in quotes).
Can you attach a cleaner file?
02-24-2022 01:29 PM
@altenbach wrote:
Your csv file has no resemblance to your problem description. There are quite a few commas and there is no obvious delimiter (A plain comma gives quite a few columns and If we use <",>, e.g. the fifth line has a problem getting the group because the string is not in quotes).
yes ... also there appear to be 25 groups instead of 23 ...
02-24-2022 02:45 PM
Indeed, sorry I put the wrong file.
I attach it here with the 25 groups.
To simplify the problem of the heatmap, I would like to calculate for each pair of groups (i,j), the proportion of group i in group j and the proportion of group j in group i. The groups have different sizes.
If I manage to fill in this matrix, then the heatmap will be straightforward.
02-24-2022 03:20 PM
02-24-2022 03:24 PM
@ziedhosni wrote:
I would like to calculate for each pair of groups (i,j), the proportion of group i in group j and the proportion of group j in group i. The groups have different sizes.
"Proportion" of what? Are you talking about the number of unique elements in each group?
02-24-2022 03:56 PM
As already mentioned, I probably would use a map where the key is the group# (I32) and the value is a set of names (strings).
Here's a quick draft. (this works with your original file, ignoring any line that does not start with a <">.)
There are plenty of ways to compare set(i) and set(k) in the stack of loops on the right. Once your description is a bit less ambiguous, we can narrow it down. Modify as needed.
Note that sets only contain unique elements.
02-25-2022 10:53 AM
Let me put a real-life example. Let's imagine some students participating in the clubs of the university. Some of them decide to join one club each and some participate in different clubs.
The heat map will measure the proportion of each club i members in club j. The diagonal will be 1 obviously.
If no member of the club of music is participating in theatre club then we have zero in that interesction (pair).
I keep forgetting the dataset.