Who's Who in Large Language Model Science? Mapping Science as a Graph | Kenelyze | Interactive Network Analytics & Visualization

Interactive version of the visualization: https://kenedict.com/networks/llm

Large Language Models and their applications are here to stay, with key players such as Microsoft, Google, Baidu and Alibaba all having announced ChatGPT-like functionality or products. Of course, there’s a lot of science behind these achievements. Let’s turn to graphs to find out more about the key players and their publications.

Data Collection & Preparation

OpenAlex‘s excellent API was used to collect publications containing relevant terms in their title or abstract metadata. The terms used for the search were:

“large language model”
“generative language model”
“autoregressive language model”
“transformer language model”
“transformer-based language model”
“large transformer model”
“generative pre-trained transformer”

This is by no means an exhaustive list of relevant keywords, but should give us enough relevant data to work with for the example presented here.

The output was filtered to only include records which contain information on the participating institutions. Duplicate records (for example, the same paper posted on multiple sources) were removed manually. The institutions metadata was finally cleaned using OpenRefine.

Creating the graph in Kenelyze

To take a look at the institutions behind LLM-related science, let’s creating a graph consisting of two node types: Publications and Institutions. We can do this in Kenelyze by picking the ‘Multiple Node Types’ network type when importing data, and then selecting our columns of interest in the dataset:

After importing the data, Kenelyze automatically generates a layout and colors nodes by their communities.

Many graphs consist of multiple connected components, i.e. sub-groups of nodes in the network which can reach each other directly or indirectly. For this example, we’ll be looking at the largest connected component of papers and institutions. You can determine the components in your graph using the Components button in Kenelyze’s Metrics menu:

Here’s what the final network looks like, after adjusting some visualization settings to show more labels:

Exporting interactive visuals

To be able to share this visual with others, Kenelyze can export the created visual to a fully interactive, single HTML file using the Export Visual button in the menu bar:

The final visual can be explored here: https://kenedict.com/networks/llm

Of course, this is just one of many perspectives graphs can give on this specific dataset. In other types of analysis, we could examine collaboration networks by connecting institutions when they co-appear on papers, or apply text-based clustering to get a look at clusters of topics and themes.

Interested in doing this for other datasets, or want to find out more about how Kenelyze can support you in going from data to graph-powered insights? Let us know at info@kenelyze.com

On 10/02/2023 / Blog

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	This cookie is set by Addthis to make sure you see the updated count if you share a page and return to it before our share count cache is updated.
__atuvs	30 minutes	This cookie is set by Addthis to make sure you see the updated count if you share a page and return to it before our share count cache is updated.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_44195702_8	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
uvc	1 year 1 month	The cookie is set by addthis.com to determine the usage of Addthis.com service.

Who’s Who in Large Language Model Science? Mapping Science as a Graph

Product

Navigation

Contact