8 Live Session
8.1 Schedule
- Introduction
- Recap Pre- Live Session Content
- Real World Data is Messy
- Data Wrangling Principles
- Merging Datasets
- Cleaning Species Names
- BREAK
- Planning a Wrangling Workflow
- Collaborative Task
- Share Collaborative Task Results
- BREAK
- What Does Collaboration Mean to You?
- Prepare for Challenge Task
- Conclusion
8.2 Introduction
Welcome to the One Health Vector-Borne Diseases Hub Online Training. My name is Chloё, and I work with the VBD Hub to develop training and workshops, like this session today. We are also joined by our lovely demonstrators, who will be available throughout the session to provide support and answer any questions you have.
VBD Hub is a non-profit, open-source project funded by UKRI and Defra, which aims to improve accessibility and information sharing. To do this, the project builds infrastructure and tools to allow researchers to combine knowledge and share data within the VBD research community and with policymakers.
Our focus today is Data Wrangling with Hub Search and ohvbd. By the end of this training, you should be able to:
- Use VBD Hub resources to access shared VBD datasets.
- Effectively wrangle real-world VDB datasets ready for further analysis.
- Build collaborative, professional connections within the VBD community.
In the Pre- Live Session content, you will have seen links to recap materials and cheat sheets. Feel free to use these if you need any reminders. If you need additional support, the Forum is a good first point of call where you can discuss queries with fellow participants. Our demonstrators will keep an eye on the chat during this call and can provide more support during tasks.
The written version of this content is now available on the VBD Hub website if you wish to follow along with this format. These written materials will be available for you to access in future, including the code examples. You are welcome to follow along with the walkthrough code in this Live Session, but there is no pressure, and you can have a go at the code yourself later.
If you have any technical difficulties or lose connection, try joining the meeting again when you can. If you need technical support, please contact support@vbdhub.org (note: this is only for technical support, not statistical support or questions on the course content).
We have breaks scheduled into this session, but if you need to step away for a few minutes at all, feel free to do so quietly.
8.3 Recap Pre- Live Session Content
In the Pre- Live Session content, we covered:
- Navigating the VBD Hub website and where to find key resources.
- Using Hub Search and the ohvbd package to search and retrieve data.
- Discussing various topics with the VBD community on the VBD Hub Forum.
- Applying data wrangling principles to real-world datasets.
During the tasks, we tried identifying patterns and details out the datasets and proposing hypotheses from our visualisations:
- Hub Search: How many results did your search for abundance data on Ixodes ricinus return?
- Hub Search: What is the name, publication data, and source database of your selected dataset?
- Data wrangling: What data types did you see in the ohvbd dataset?
It is normal if you found some of this content challenging, it can sometimes be tricky when you are getting your head around new resources.
In the Pre-Live Session content, we used search_hub() in the ohvbd package to search and retrieve datasets. This is the base function for an ohvbd search, and is useful for data exploration. However, if you know what you are looking for, you can make this process more efficient by specifying the source database directly into the fetch:
search_hub("ixodes ricinus", db = "vt")This approach is typically significantly faster as it avoids retrieving unnecessary GBIF metadata.
Tip: You can write ?search_hub or ?fetch_vt in the R console to access guide documents if you need extra help using ohvbd.
8.4 Real World Data is Messy
Real-world VBD data is rarely clean, especially if it has been collected for different purposes or recorded using different reporting standards.
Before making any changes to your datasets, it is important to understand how your dataset is formatted so you can apply the most appropriate wrangling and cleaning approaches. When working with VBD datasets, you may come across:
- Missing values
- Inconsistent data types
- Poorly named columns
- Duplicate records
- Data stored in inconvenient formats.
Rather than viewing each of these issues separately, it can be useful to recognise common patterns. For instance:
- Inconsistent naming - affects merging
- Mixed data types - affects calculations
- Wide format - affects analysis and visualisations
When we think about data inconsistency patterns in this way, we shift our focus from problems to decisions. Each issue you identify in your dataset requires a decision:
- Should missing values be removed or retained?
- Should columns be renamed or merged?
- Should data be formatted from wide to long format?
When we stop asking “what is wrong with this dataset?” and start thinking “what do I need this dataset to do?”, we can make informed decisions and prioritise the most appropriate data wrangling approaches to your data and research question.
8.5 Data Wrangling Principles
Data wrangling is a broad topic, and there is no single “correct” way to wrangle data. The methods you choose to wrangle your data will depend on your research question, the type of data you are working with, and how it is formatted.
When we access VBD data using tools like Hub Search or the ohvbd package, we typically retrieve data from different sources. These sources might use slightly different formats, names, and details, and therefore need to be wrangled before we use them for analysis.
People typically think of data wrangling as a list of set steps to work through. Try to reframe data wrangling as a process of making your data fit for your research - the goal is to make the data usable and reliable for your own specific analysis.
Note: You might repeat some data wrangling principles across different datasets, but for each new dataset, try to consider how you want the data to look for the analysis you are planning to use.
There are numerous ways to wrangle your data, including filtering rows, converting data types, handling missing values, and standardising units.
We cannot cover every data wrangling principle within a single training session. Today, we will focus on two methods commonly applied to VBD data:
- Merging data.
- Cleaning species names.
8.6 Merging Datasets
Often, research workflows incorporate more than one dataset as it is rare for a single dataset to contain all the information you need to answer your research question. For instance, you might have one dataset on species abundance data and another on environmental variables. If you want to analyse how the environment influences species abundance, you will likely want to combine these into a single dataset.
We call this merging or joining datasets.
For a merge to work effectively, both datasets must share at least one common column name. This is often referred to as a key. In VBD datasets, a common key might be a species name, a location, or a date.
Let’s imagine we have two datasets which both contain a column called species. We can merge these two datasets using the left_join() function from the dplyr package:
This function keeps all the rows from dataset_a and adds matching information from dataset_b where the species value is the same.
There are different types of merges or joins, each of which acts slightly differently:
- A left join keeps all the rows from the first dataset and adds matching values from the second. This is a safe choice when you do not want to lose data.
- An inner join,
inner_join(), only keeps the rows that appear in both datasets. Useful when you are only interested in complete matches, but risks accidental data loss. - A full join,
full_join(), keeps all the rows from both datasets, filling in missing values where matches do not exist. This can be helpful in exploratory work but may require further downstream data cleaning.
Choosing which merge to use depends on your research question and how you want to format rows that don’t align across datasets.
We can also merge by multiple columns when a single column is not enough to uniquely identify a match. For example, in VBD research, we might need to merge by species and location:
When we set multiple key columns, we ensure that matches only occur when both the species and the location align. This is particularly useful when working with ecological or epidemiological data, where the species might appear in multiple regions and should be accounted for with this in mind.
Merging is usually straightforward, but it can become tricky when we assume the key represents the same thing in both datasets, but the data contains inconsistencies. For example, if dataset_a formats species names as "Ixodes ricinus" and dataset_b formats species names as "ixodes_ricinus", R will not identify these species as a match for merging.
Frequent mistake: A successful merge does not always guarantee a correct merge. Even if your code runs without errors, the result may not be what you were aiming for. It is important to check whether the new, merged dataset makes sense, for instance has the number of rows changed drastically, are there missing values in the new columns, and do the matches look correct?
A useful quick check is:
This will check the number of rows in the original dataset and the new, merged dataset. A significant increase in the number of rows might suggest duplicate matches, and a significant decrease indicates you might have lost data.
8.7 Cleaing Species Names
Species names are one of the most common causes of inconsistency when merging datasets from multiple sources.
Small formatting differences can prevent datasets from merging correctly or cause inaccuracies in later analyses. Mismatched species names are usually caused by:
- Differences in capitalisation -
"Ixodes ricinus"or"IXODES RICINUS". - Using spaces or underscores -
"ixodes ricinus"or"ixodes_ricinus". - Extra text, such as “spp.” -
"Ixodes ricinus spp.". - Duplicate rows for the same species.
Although we know these all represent the same species, R will recognise each differently formatted name as a different value.
A good starting point when standardising species names is setting all text to lowercase:
We can also make sure species names in our data don’t have any unwanted characters, such as underscores:
Or additional text, such as “spp.”:
Or any extra spaces:
If species names are not standardised across our datasets:
- Merges between datasets could fail.
- Duplicate species entries might be created.
- Analyses might produce inaccurate results.
8.7.1 Example
Let’s imagine we have used ohvbd to retrieve a dataset on mosquito abundance, mosquito_abundance_data, and another dataset on mosquito habitats, mosquito_habitat_data, in order to analyse patterns of abundance dependent on habitat type:
mosquito_abundance_data
#> species location abundance
#> 1 Aedes aegypti Site1 10
#> 2 Culex pipiens Site1 5
#> 3 aedes_aegypti Site2 12
#> 4 Culex pipiens spp. Site2 7
mosquito_habitat_data
#> species location habitat
#> 1 aedes aegypti Site1 urban
#> 2 culex pipiens Site1 wetland
#> 3 aedes aegypti Site2 urban
#> 4 culex pipiens Site2 wetlandWe can try to merge these datasets in their raw format:
library(dplyr)
merged_mosquito_data <- left_join(
mosquito_abundance_data,
mosquito_habitat_data,
by = c("species", "location")
)Let’s double check what our data looks like using head() and nrow():
head(merged_mosquito_data)
#> species location abundance habitat
#> 1 Aedes aegypti Site1 10 <NA>
#> 2 Culex pipiens Site1 5 <NA>
#> 3 aedes_aegypti Site2 12 <NA>
#> 4 Culex pipiens spp. Site2 7 <NA>
nrow(merged_mosquito_data)
#> [1] 4Oh dear! We can see that our merge has run, but the rows have not matched correctly. We would expect a habitat column from the dataset we wanted to combine, but this isn’t showing because something has gone wrong in the merge.
This is because the species names are formatted differently across the two datasets, so R has treated these as different values. Although the locations match, both the location and species need to match exactly for the merge to work.
Let’s have a look at our original data:
head(mosquito_abundance_data)
#> species location abundance
#> 1 Aedes aegypti Site1 10
#> 2 Culex pipiens Site1 5
#> 3 aedes_aegypti Site2 12
#> 4 Culex pipiens spp. Site2 7
head(mosquito_habitat_data)
#> species location habitat
#> 1 aedes aegypti Site1 urban
#> 2 culex pipiens Site1 wetland
#> 3 aedes aegypti Site2 urban
#> 4 culex pipiens Site2 wetlandWe can see some inconsistencies with our species names, including:
- Difference in capitalisation
- Using underscores instead of spaces
- Additional text
Let’s make sure all our species names are lowercase and remove unwanted formatting:
clean_mosquito_abundance_data <- mosquito_abundance_data |>
mutate(species = tolower(species)) |>
mutate(species = gsub("_", " ", species)) |>
mutate(species = gsub(" spp\\.", "", species))
clean_mosquito_habitat_data <- mosquito_habitat_data |>
mutate(species = tolower(species)) |>
mutate(species = gsub("_", " ", species)) |>
mutate(species = gsub(" spp\\.", "", species))Now that our species names are consistent, we can try merging again:
merged_mosquito_data <- left_join(
clean_mosquito_abundance_data,
clean_mosquito_habitat_data,
by = c("species", "location")
)
head(merged_mosquito_data)
#> species location abundance habitat
#> 1 aedes aegypti Site1 10 urban
#> 2 culex pipiens Site1 5 wetland
#> 3 aedes aegypti Site2 12 urban
#> 4 culex pipiens Site2 7 wetland
nrow(merged_mosquito_data)
#> [1] 4We can see that now our species names are consistent, the merge has worked as expected. Each row has been correctly matched by both species and locations, so the habitat data has been correctly combined with the abundance data. The dataset is now in a usable format for further wrangling and analysis.
Frequent mistake: People often only clean one dataset before merging, but data for the key column needs to be consistent across both datasets for the merge to be successful.
So far, we have used straightforward approaches such as converting all text to lowercase and removing consistent unwanted characters like “spp.”. These techniques are useful starting points, but species names in real VBD datasets often contain more complex inconsistencies.
More complex cases might involve:
- Abbreviated genus names -
"I. ricinus". - Additional descriptors contributing to varied additional text -
"Ixodes ricinus aff."or"Ixodes ricinus cf.". - Mixed formatting within the same column.
In these cases, you may need to think critically about what formatting should be retained and what should be removed, and require more steps to clean the data. For instance, if we had a dataset where the species name contained these values:
"Ixodes ricinus""Ixodes_ricinus""Ixodes ricinus spp.""IXODES RICINUS""I. ricinus"
If we apply our earlier name cleaning techniques, we might standardise most of these species name formats, but "I. ricinus" would remain as is, and R would process this as a different value.
In situations like this, where not all inconsistencies can be solved with simple text replacement, you will likely need to inspect unique values in your dataset:
unique(clean_data$species)This allows you to see all the distinct values in the species column, so that you can identify inconsistent formats, such as "I. ricinus", and decide on a consistent naming format for your data.
Specific cases might need to be manually recoded:
clean_data <- clean_data |>
mutate(species = ifelse(species == "i. ricinus", "ixodes ricinus", species))As species columns are typically used as a key when merging in VBD research, ensuring species names are consistent across our datasets can allow us to merge multiple datasets more confidently and reliably.
Tip: Automating our workflow feels much easier, but it is important to balance this with manual reviewing to ensure you don’t miss specific cases like these. To support this balance, we can use a three step approach:
- Apply broad cleaning techniques - ensure all text is lowercase, remove underscores, remove common additional text.
- Inspect the results using
unique().
- Inspect the results using
- Manually correct any remaining inconsistencies.
8.7.2 (optional) Resolving species names using GBIF
When using large or messy datasets, an alternative approach you might choose is to match your species names against a recognised taxonomic database.
The rgbif package can be used to help standardise species names using the Global Biodiversity Information Facility (GBIF). We can use the name_backbone() function to try to match your species name input to a standardised species name in the GBIF backbone taxonomy:
This approach can help to identify spelling or formatting errors and synonymous species names when simple cleaning techniques are insufficient for your dataset.
Note: In this training session, we will focus on using the cleaning techniques we discussed earlier, rather than using GBIF.
8.8 Planning a Wrangling Workflow
We have discussed how to reframe data wrangling as a process, rather than a rigid set of steps.
When you open a new dataset, it can be tempting to start making changes immediately. To relieve this temptation, we can be prepared with a guide to a practical workflow that leaves room for flexibility to account for your specific dataset and research questions.
Guide to a practical, but flexible workflow:
- Inspect the data - understand the structure, variables, and data types.
- Identify any issues - look for inconsistencies, missing values, and formatting problems.
- Prioritise tasks - decide which issues are most important for your analysis.
- Apply cleaning steps - after you understand the data, use appropriate data wrangling approaches.
- Check results - ensure the changes you have made have worked in the way you expected.
Without a clear workflow, it is easy to lose track of the changes you have made, and introduce new errors, especially if you don’t check your results.
Approaching data wrangling in this way helps to ensure your work is reproducible, efficient, and aligned with your research aims.
Tip: To keep track of your workflow, add comments to your code explaining what changes you made and why:
# Convert data to all lowercase to ensure consistency across datasets.
8.9 Collaborative Task
Let’s have a go at applying what we have learnt so far by working together in breakout rooms. Each group will be given a short example of data wrangling code, along with a small dataset. The code contains errors or issues for you to work collaboratively to identify and fix.
Together, you will:
- Identify what the code is trying to do.
- Discuss how you might approach and debug any errors or problematic code.
- Edit and improve the code so that it runs without errors and produces effective results.
- Prepare to give a brief summary on why your group made those changes when we return to the main meeting room.
To work productively as a group, you might choose to delegate responsibilities, for example, one person might run the code, one might take notes, and one might guide the discussion. There will be a demonstrator in each group to support your work and answer any questions you might have.
The aim of this activity is not to produce perfect code or script, but to apply what you have learnt so far and think critically about data wrangling principles. Reflecting on your decisions is an important part of developing effective data wrangling skills.
After completing the task in your breakout room, we will join the main room again. Each group will be able to share the changes they made to wrangle their dataset, and why they made these decisions.
8.9.1 Example 1
# Dataset A
data_a <- data.frame(
species = c("aedes aegypti", "culex pipiens", "anopheles gambiae"),
abundance = c(10, 25, 5)
)
# Dataset B
data_b <- data.frame(
Species = c("aedes aegypti", "culex pipiens", "anopheles gambiae"),
trait = c("urban", "rural", "rural")
)
# Merge datasets
merged_data <- left_join(data_a, data_b, by = "Species")8.9.3 Example 3
# Dataset A
data_a <- data.frame(
species = c("aedes aegypti", "culex pipiens", "anopheles gambiae"),
abundance = c(10, 25, 5)
)
# Dataset B
data_b <- data.frame(
species = c("aedes aegypti", "culex pipiens"),
trait = c("urban", "rural")
)
# Merge datasets
merged_data <- inner_join(data_a, data_b, by = "species")8.9.4 Example 4
# Dataset A
data_a <- data.frame(
species = c("aedes aegypti", "aedes aegypti", "culex pipiens"),
location = c("site1", "site2", "site1"),
abundance = c(10, 15, 20)
)
# Dataset B
data_b <- data.frame(
species = c("aedes aegypti", "culex pipiens"),
location = c("site1", "site1"),
temperature = c(25, 22)
)
# Merge datasets
merged_data <- left_join(data_a, data_b, by = "species", "location")8.11 What Does Collaboration Mean to You?
So far in this session, we have focused on applied data wrangling skills in the context of VBD datasets. These skills are important when we retrieve data from VBD Hub resources such as Hub Search and ohvbd.
However, in the Pre- Live Session content we also explored the VBD Hub Forum, and introduced the idea of collaboration.
Collaborations refer to working with others to support research and improve outcomes. They can take many forms, it may include:
- Sharing datasets
- Providing feedback on analysis
- Working together on joint papers or presentations
- Contributing to community discussions.
Different people want different things from collaborations. Some might be looking to share their data or resources, others might want support with their analysis, and others might be interested in developing long-term research networks.
With this in mind, let’s hear from you:
- Why do you want to collaborate?
- What do you think makes a collaboration successful?
- What challenges might you expect in collaborative work?
Good collaborations often involve clear communication, shared goals, and transparency in methods and data.
Collaborations can be challenging when expectations are unclear, record keeping is limited, or communication is poor.
The VBD Hub is built to support collaborations within the VBD community by providing resources to access shared datasets and a Forum for discussion and knowledge sharing.
How might you use the VBD Hub Forum to support your own research or collaborations?
Combining the ability to use resources such as Hub Search and ohvbd, applied data wrangling skills, and the value of collaboration will prepare you to set up effective workflows for your own independent and collaborative future research.
8.12 Preparing for the Challenge Task
The final session of this training will provide an opportunity for you to independently apply the skills and concepts discussed throughout the Pre- and Live Session content, including:
- Navigating the VBD Hub website and where to find key resources.
- Using Hub Search and the ohvbd package to search and retrieve data.
- Data wrangling techniques commonly applied to VBD datasets, including merging and fixing species names.
- Practice applying data wrangling principles to real-world datasets.
- Using the VBD Hub Forum and collaborating within the VBD community.
The Challenge Task will have multiple levels and is designed to encourage applied thinking. Feel free to work through the levels that apply to you, but we encourage you to try all levels to make the most of the training.
During the Challenge Task, we encourage you to experiment with different approaches and discuss potential difficulties with each other via the VBD Hub Forum. Our demonstrators and I will be monitoring the Forum if you need any additional support.
We encourage you to have a go at the task on your own, but a walkthrough version will be released after a few hours, should you need additional guidance.
8.13 Conclusion
Throughout this workshop, we have explored how to search, retrieve, and wrangle VBD data so that we have a better understanding of the datasets we are working with, ready for effective further analysis.
We began by retrieving datasets with the Hub Search and ohvbd, and applying foundation data wrangling principles, such as using informative names and converting from wide to long format. We then introduced two data wrangling techniques commonly used in VBD research: merging and fixing species names, and worked together to apply these to real VBD datasets. Finally, we discussed effective collaborations, and how these can be supported by using the VBD Hub Forum.
Effective data wrangling and collaboration are valuable skills for researchers working with complex datasets. By carefully considering how data are formatted, we can ensure our work is reproducible and suitable for further analyses. Clear, consistent and reproducible datasets support positive collaborations by allowing everyone involved to understand the data used to answer the group’s research questions.