(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Graphia: A platform for the graph-based visualisation and analysis of high dimensional data [1] ['Tom C. Freeman', 'The Roslin Institute', 'Easter Bush Campus', 'The University Of Edinburgh', 'Edinburgh', 'United Kingdom', 'Kajeka Limited', 'Roslin Innovation Centre', 'Sebastian Horsewell', 'Anirudh Patir'] Date: 2022-08 Abstract Graphia is an open-source platform created for the graph-based analysis of the huge amounts of quantitative and qualitative data currently being generated from the study of genomes, genes, proteins metabolites and cells. Core to Graphia’s functionality is support for the calculation of correlation matrices from any tabular matrix of continuous or discrete values, whereupon the software is designed to rapidly visualise the often very large graphs that result in 2D or 3D space. Following graph construction, an extensive range of measurement algorithms, routines for graph transformation, and options for the visualisation of node and edge attributes are available, for graph exploration and analysis. Combined, these provide a powerful solution for the interpretation of high-dimensional data from many sources, or data already in the form of a network or equivalent adjacency matrix. Several use cases of Graphia are described, to showcase its wide range of applications in the analysis biological data. Graphia runs on all major desktop operating systems, is extensible through the deployment of plugins and is freely available to download from https://graphia.app/. Author summary Graphia is a new visual analytics platform specifically created for the network-based analysis of large and complex data, such as that generated in huge amounts by modern biological analyses. It works in a data agnostic, hypothesis-free manner to generate correlation networks from any table of numerical or discrete values, thereafter providing a means to rapidly visualise the often very large networks that result, in either 2D or 3D space. Following network construction, the tool offers an extensive range of analysis algorithms, routines for network transformation, and options for the visualisation of metadata. This provides a powerful analysis solution for the exploration and interpretation of high-dimensional data from any source, as well as any data already defined as a network. Several use cases of Graphia are described to showcase its wide range of applications in the analysis biological data. Graphia is open source and free to all. Citation: Freeman TC, Horsewell S, Patir A, Harling-Lee J, Regan T, Shih BB, et al. (2022) Graphia: A platform for the graph-based visualisation and analysis of high dimensional data. PLoS Comput Biol 18(7): e1010310. https://doi.org/10.1371/journal.pcbi.1010310 Editor: Eli Zunder, University of Virginia, UNITED STATES Received: July 16, 2021; Accepted: June 16, 2022; Published: July 25, 2022 Copyright: © 2022 Freeman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Graphia is an open-source tool and as such all code is freely available. See: https://github.com/graphia-app/graphia. Funding: Graphia was originally designed and built by Kajeka Ltd., a University of Edinburgh spin-out company (2015-2020). We would like to acknowledge all those who supported this venture, in particular grant funding from Scottish Enterprise (SMART/14/034 / 14/9168). Although no specific academic grant funding was received for this project, during a period of Graphia’s development by Kajeka, TCF and JP were employed by the University and supported by the Roslin Institute’s Strategic Grant from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC) [BBS/E/D/10002071]. More recently Janssen Immunology have funded further development of the tool (core funding). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Introduction The study of interactions between entities is a cornerstone of modern analytics. In biology, efforts to map the ‘interactome’—all the interactions between the components of a biological system—have been underway for some time, generated by a number of complementary approaches [1,2]. Networks of biological data may also be used to chart diverse phenomena such as the spread of disease, the interactions between drugs and their targets, and the evolutionary relationships between species. Many data from other sectors are also inherently graph-based in structure. For example, interactions on social media platforms, customer/client relationships, communication and transport systems, computer networks and many other real-world systems. Matrices of numerical data that do not inherently possess a network structure can also be analysed using graph-based approaches. Wherever it is possible to calculate the distance between entities, a graph can be constructed using high confidence measures to define the edges between entities, represented by nodes. In biology, such an approach is already widely used to analyse high dimensional data, in particular to construct and analyse gene coexpression networks [3,4], but the approach is applicable to any numerical or categorical data from any source. Given the explosion in the availability of data in recent years and the potential to visualise and analyse it using graph-based approaches, a variety of software tools to support these activities have been developed. In biology, Cytoscape [5] is perhaps the most widely used software for performing graph analytics. It has a large user base and supports many ‘apps’ (plugins) created by the community for the performance of specific graph-based analysis tasks. Other network visualisation and analysis tools include; Gephi [6], Tulip [7], Bandage [8], Graphviz [9], Pajek [10], yEd (yFiles, Tübingen, Germany), BioLayout [4,11], Social Network Visualiser [12] and NodeXL [13]. We have compared a number of the most commonly used network analyses tools used in Table 1 and in the case of Gephi and Cytoscape compared some of the key aspects of graph visualisation (S1 Fig). There are also a range of web-based software tools exclusively designed to visualise portions of data, often from a designated database, such as String [14], GeneMania [15] and Neo4J Bloom [16]. Some of these tools are focused on supporting a particular community, whilst others possess functionality tailored towards specific tasks or data types and include a mix of open-source projects and commercial tools. Others provide open-source code repositories for graph visualisation and analysis algorithms [17], or share repositories of graph data [18–20]. For a more comprehensive review of network analysis tools and resources, see [21]. PPT PowerPoint slide PNG larger image TIFF original image Download: Table 1. Summary of network visualisation tools commonly used for the analysis of biological data. https://doi.org/10.1371/journal.pcbi.1010310.t001 Despite the availability of a wide range of downloadable applications, web-resources and code libraries to support graph-based analyses, there is a pressing need for easy-to-use software that supports the rapid visualisation and analysis of relatively massive networks. To address this need, we developed Graphia—a general purpose graph analysis tool that supports the integration, visualisation, analysis, and interpretation of a wide variety of data types. Here we provide an overview of Graphia’s core functionality for the analysis of graphs and describe a number of case studies in which it is applied to solve problems associated with the analysis of data derived from the biological sciences. Methods Design criteria The following features were considered core to the design of Graphia: Data and operating system agnostic. Import data from any source saved in standard file formats. The software should run on all major desktop operating systems and modern hardware configurations. Import data from any source saved in standard file formats. The software should run on all major desktop operating systems and modern hardware configurations. Fast and scalable. Support the rapid loading of data, fast computation of graph layout and analysis algorithms, high quality data visualisations. Deliver smooth and responsive graphical rendering of millions of data points (node/edges) on standard desktop hardware. Support the rapid loading of data, fast computation of graph layout and analysis algorithms, high quality data visualisations. Deliver smooth and responsive graphical rendering of millions of data points (node/edges) on standard desktop hardware. Dynamic rendering. Visualise in real time changes to the graph structure associated with alterations in input parameters or additional data. Visualise in real time changes to the graph structure associated with alterations in input parameters or additional data. 3D graph visualisation. Provide a navigable and immersive environment in which to explore and interpret large and complex graph topologies. Provide a navigable and immersive environment in which to explore and interpret large and complex graph topologies. Correlation graphs as an essential function. Rapidly convert any numerical or categorical data table into a correlation graph, supporting pattern finding and data mining. Rapidly convert any numerical or categorical data table into a correlation graph, supporting pattern finding and data mining. Attribute handling and visualisation. Visualise attributes (metadata) associated with nodes and edges using colour, size and text to distinguish between attribute values. Visualise attributes (metadata) associated with nodes and edges using colour, size and text to distinguish between attribute values. Advanced analysis capabilities. Support a wide range of analytical algorithms and approaches that empower a user to explore, query and interpret data, such as the k-NN algorithm [22] for edge pruning, and the MCL [23] and Louvain [24] graph clustering algorithms. Support a wide range of analytical algorithms and approaches that empower a user to explore, query and interpret data, such as the k-NN algorithm [22] for edge pruning, and the MCL [23] and Louvain [24] graph clustering algorithms. Extensible. Provide extensible architecture through use of a plugin system to allow the core to be extended or adapted for specific application areas or data types. Provide extensible architecture through use of a plugin system to allow the core to be extended or adapted for specific application areas or data types. User Interface. Provide a simple and intuitive user interface (UI) that is easy to navigate, featuring a graph display area supplemented with a table listing selected nodes and associated attributes and data values. The UI should provide easy access to menus providing functionality and display active transformations and visualisations (Fig 1). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. Graphia user interface displaying a correlation graph. (A) Graph display area, showing correlation graph with a cluster selected (unselected nodes faded). (A1) Display context menu options (right click). (B) Node (row) attribute display area. (B1) Table of selected nodes and their attributes, imported and calculated within the tool. (B2) Data plot area, in the case shown here a mean histogram of selected node values. (B3) Visualisation of column annotations. (B4) Data plot context menu options for changing plot (right click). (1a) Add Transform button, (1b) Active transforms, (2a) Add visualisation button, (2b) Active visualisations, (3) General toolbar, (4) Attribute parameter selection, (5) Display of graph metrics (number of nodes, edges, components), (6) Plot/table function toolbar. https://doi.org/10.1371/journal.pcbi.1010310.g001 Code architecture Graphia is written in C++17 and is built upon Qt version 5, the cross-platform widget toolkit. For graphics, the industry standard OpenGL is used. The minimum driver support required is version 3.3 core profile, but more modern extensions will be used if they are available. Various open source libraries are employed, mostly for loading external data formats. These libraries and their associated licenses are enumerated in the About dialog of the application, accessible from the Help menu. Graphia is architected so that loading, and data type specific user interfaces are confined to plugins. These are independent modules to the core application and can be removed or added without affecting any base functionality. At the highest level the code is organised hierarchically into four separate directories: app—the core application code plugins—the existing bundled plugins shared—code used by both the core and plugins (this includes interface headers) thirdparty—any library code not authored locally These are further divided into subdirectories dealing with specific areas of functionality. Graphia has been developed using standard object-orientated best practices. Continuous integration is employed to prevent portability build regressions, using recent versions of the compilers GCC, clang and MSVC. In addition, static analysis tools such as clang-tidy and cppcheck are used to identify potential problems early. CMake is used as a build system, and is set up for Linux, Windows and macOS compilation. Implementation Data import. Graphia has been designed to import data encoded in a variety of standard and non-standard file formats. These include standard graph-based file formats such as BioPAX OWL ontology (.owl), JSON graph (.json), GraphML (.graphml), Graph Modelling Language (.gml), Cytoscape Exchange Files (.cx/.cx2) and MATLAB data file (.mat) formats, but also unstanderdised formats such as edge lists, adjacency matrices, and tabular data prepared for correlation analyses. Using these file formats, a wide variety of data may be imported into Graphia, not only in terms of defining the nodes and edges of a graph, but user-defined attributes or metadata. Fast and scalable. Existing graph visualisation tools either fail to render very large graphs effectively, or the ability to interact with a graph once rendered is limited and slow. Therefore, all aspects of Graphia’s functionality have been engineered to run quickly. Graphia can render graphs millions of data points on relatively commonplace hardware, where interaction with them is fast and fluid. This has been achieved through the use of optimal coding practices and parallelisation of computationally intensive analysis routines, e.g. calculation of correlation matrices, graph layout, clustering. Dynamic graph layout and rendering. Graph layout is an iterative process. Many programs only display the results of a layout algorithm after it has run a defined number of iterations. With Graphia, the layout is shown live, such that graphs ‘unfold’ in real time. However, the true power of dynamic graphs is realised when a transformation operation is performed or following the addition of new data. These changes are immediately reflected in the appearance of the graph. As graphs change dynamically, there is a need to identify the graph components and map how they interact when construction parameters are adjusted. The ability to quickly identify and move between components is a unique feature of Graphia. Components are rendered in a concentric pattern, arranged large-to-small. Smaller components may be filtered away using a transform. Most existing network analysis tools render graphs in 2D. The layout algorithm implemented in Graphia is innovative in that it applies current force-directed layout techniques, but in a dynamic fashion. Graphia renders graphs in 3D or 2D, making use of modern graphics hardware to display extremely large graphs efficiently with options for node/edge shading, relative node sizing and spacing (Fig 2A–2F). When graphs are relatively small or there is a need to share images by a conventional medium, i.e. a document, 2D graph visualisations have advantages. However, 2D visualisations are limiting when there is a need to display and interact with large graphs with complex topologies. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 2. Different graph visualisation options. (A) 3D perspective view, smooth shading (the default), with visualisation of node categorical attribute (MCL cluster). (B) 3D orthographic view, flat shading (no perception of distance—all nodes same size, unless sized by attribute value). (C) 3D perspective view, smooth shading. (D) 2D view, smooth shading. (E) 2D view, flat shading. (F) compressed 2D layout, flat shading, showing node overlap view. Visualisation of (G) Betweeness centrality values, (H) Eccentricity values, (I) PageRank values. G-I are continuous (numerical) attributes, so a colour spectrum and size gradient is used for node display (2D, smooth shading). Betweenness and eccentricity are calculated for both nodes and edges, therefore visual encoding is applied to both. https://doi.org/10.1371/journal.pcbi.1010310.g002 Attribute-to-visual mapping. Attributes are data values associated with nodes/edges. These can be user-defined or calculated by Graphia. For example, a node representing a person may be associated with knowledge of their gender, occupation, socioeconomic class, ethnicity, etc. (categorical attributes), as well as their height, age, weight, years in employment (numerical attributes). Colour can be used to represent categorical attributes, with nodes sharing the same attribute being assigned the same colour. In the case of numerical attributes, colour and size can be used to represent the value according to a spectrum, e.g. from small white nodes to big red nodes to represent low and high values, respectively. Both types of attribute may also be calculated from the graph itself, e.g. the assignment of nodes to clusters or calculation of node degree, PageRank values etc. (Fig 2G–2I). Visualising attribute values may help explain graph structure, for instance an area of a graph might be visibly associated with nodes of a given attribute. Attributes may also be used to analyse the statistical associations with graph topology, for example, graph clusters may be analysed for the enrichment of nodes with a given attribute. Attribute values may also be specific to only a single element, e.g. a unique node name. Discussion Data-driven research is now a foundation of modern biomedical and agricultural sciences, due to continued growth in the size and complexity of biological datasets. Network analysis provides a flexible toolbox combining visualisation with the algorithmic analysis of data structure, for testing a broad range of hypotheses and hypothesis-free data explorations. Graphia is designed for the visualisation and analysis of large graphs. Originally, our interest in graph-based analysis was driven by our desire to analyse large correlation networks of transcriptomics data. The results of weighted gene coexpression analysis (WGCNA)3 are generally visualised as a tree diagram or heat-map. The precursor of Graphia, BioLayout Express3D [4,11], was developed specifically to generate and display transcriptomic data and pathway modelling [4,38,39]. BioLayout has been used in the analysis of many large transcriptomic datasets from multiple species [40–44]. It has also been applied to datasets that were not envisaged at the time, for example the relationship between symptoms of altitude sickness [45], the honey bee microbiome [46], comparing morphometric measurements of dog brains [47] and even naming patterns in historical birth records [48]. The addition of new functionality to BioLayout was however constrained by inherent limitations in the code structure and programming language (Java). Graphia is an entirely new analytical platform developed using a modern UI framework (Qt) and programming language (C++). The correlation plugin reproduces and improves upon the functionality of Biolayout for the analysis of any high dimensional numerical matrix. Data visualisation is core to the functionality of Graphia. Good visualisations make it easier for a user to recognise patterns, trends, and outlier groups within data. The next step in an analysis is determined by insights gained from the interaction with the visualisation, whether that be the discovery of errors in the input data, data effects due to technical reasons, or from new and interesting discoveries. Graphia is designed to make best use of the latest accelerated graphics hardware, to make graph visualisations scalable but still responsive in real time. By default, graphs are rendered in 3D, where the visualisation and navigation of complex graph topologies is much enhanced; the additional dimension providing the ability to distinguish the distance between what might appear in 2D to be closely connected nodes. Another core aspect to the visualisation of data is the concept of graphs being ‘dynamic’, changing in real-time, as nodes/edges are added or removed. To achieve this, the layout algorithm runs continuously, unless manually paused. Dynamic transitions may become challenging when graph structure alters dramatically following a transformation, such as when a hub-node is deleted from a tree graph or one graph component fragments into many. If such a transformation is executed quickly, a user’s ‘mental map’ can be lost [49]. For this reason, Graphia includes the option to slow down the transition between one state and the next, and in addition orientates components ‘in flight’ prior to their reconnection. Indeed, the way in which Graphia handles graph components dynamically is quite unique. The development of Graphia has been driven by the analytical challenges associated with data derived from the biological sciences, but it is designed as a general-purpose platform for the analysis of network data from any source. If the input data is in tabular form (continuous or discrete values), it can be used to build a graph. If data already exists in a graph format, Graphia provides a means to explore it. Graphia can load data from files, or from remote web resources, but in theory it could interface with a remote database. It is interesting to note the widespread adoption of graph databases. Not only do graph databases speed up and simplify querying of data stores, storage of data as a graph makes it easier to visualise and analyse. Whilst there are a growing number of web-based tools that support the querying and visualisation of graph databases, none possess the power of Graphia in rendering large portions of the data they store. Here we offer a high-level view of the functionality of Graphia and some examples of its many uses within the biomedical sciences. We provide installers to allow it to run on all common desktop operating systems and access to the source code, to allow users to develop new functionality to enhance its functionality for their needs. Supporting information S1 Fig. Comparison of Gephi, Cytoscape and Graphia in terms of large graph loading, layout and rendering performance. Three graphs were used for these tests, a correlation graph (panels a, d, g) generated from the GNF mouse gene expression atlas using Graphia (r = 0.7), then saved as a.gml file, and two graphs from an online repository https://chriswalshaw.co.uk/partition/#graphs - finan512 (panels b, e, h) and fe_pwt (panels c, f, i). These were selected to represent large graphs of different node/edge counts and structure. Each graph was opened using the three tools and the time taken to load and layout the graph shown in the panel recorded. In each case, a force directed layout algorithm similar to that employed by Graphia was used; for Gephi (v0.9.2) this was the Force Atlas 2 layout algorithm, and for Cytoscape (v3.9.0) the OpenCL Prefuse Layout algorithm. In different views shown, we have not attempted to fully optimise graph layout in each case but show the layouts at 10x the standard number of iterations for Gephi and Cytoscape. While these layouts completed relatively quickly, they were far from ‘finished’ in establishing a stable layout. By contrast, we ran Graphia’s (v3.0) dynamic layout to a point where layout was near optimal which took considerably longer. A notable difference between the network tools was in speed and fluidity of interaction with the graph visualisation, i.e. fps (frames per second) rates, Graphia being an estimated 3–4 times quicker than within Gephi or Cytoscape and in 3D. The specifications of the computer used for these tests are: Intel Core i7-4930K, 16Gb memory, Nvidia GeForce RTX 2060 Super, Windows 10 Pro (10.0.19043). https://doi.org/10.1371/journal.pcbi.1010310.s001 (DOCX) Acknowledgments Graphia was originally designed and built by Kajeka Ltd., a University of Edinburgh spin-out company (2015–2020). [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010310 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/