I have a dataset which has a bunch of date intervals (i.e. POSIXct format start dates and end dates).
In the example provided, let’s say it’s each period is associated to when someone was in school or out of school. I’m interested in plotting the data in ggplot2, each row is essentially data for one period. Currently all of the rows don’t have a factor variable, but I’ve put one in the example as it may make things easier to plot. It’s worth noting that in some cases the end date of one period and the beginning of the next overlap.
In the data, each row is a unique stint in school associated to a specific period. I’m interested in creating a sequence of weeks (from the first week to the last week in dataset) in the x axis and on the y axis I want just either a dot for each week to signify whether the person was in school (also identifying which stint) or out of school (even a gap perhaps would suffice). Thus perhaps an 8 level factor is needed in this case, one for each period, and a level for out of school (or perhaps no level is needed for when out of school)?
So in this case we could envisage having 7 rows of dots on the y axis, something (very loosely) like this (this example has many gap in lines, but I expect few or no gaps).
I envisaged the process to be something like: create a sequence from min(start_date) to max(end_date), join rows to this. Then somehow identify each period and create a factor variable for each period. Then plot the factor variable (e.g. period1, period2, period3) against the sequence of dates. I haven’t been able to do this though as it’s quite fiddly.
Looking at the lubridate package I was thinking that using interval() and %within% might be the solution but I wasn’t sure.
library(tidyverse)
library(lubridate)
start_dates = ymd_hms(c("2019-05-08 00:00:00",
"2020-01-17 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00"), tz = "UTC")
end_dates = ymd_hms(c( "2019-10-24 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00",
"2022-01-19 00:00:00"), tz = "UTC")
df = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)
You can try
df %>%
ggplot() +
geom_segment(aes(x = start_dates, xend = end_dates, y =studying, yend = studying, color = studying), size=3) +
geom_segment(aes(x = start_dates, xend = start_dates, y =0, yend = studying))+
geom_segment(aes(x = end_dates, xend = end_dates, y =0, yend = studying))
Per wwek as you asked in the comments
df %>%
as_tibble() %>%
mutate(start = week(start_dates),
end = week(end_dates)) %>%
mutate(gr = start>end,
start_2 = ifelse(gr, 0, NA),
end_2 = ifelse(gr, end, NA),
end = ifelse(gr, 52, end)) %>%
select(-2:-3, -gr) %>%
pivot_longer(-1) %>%
filter(!is.na(value)) %>%
separate(col = name, into = c("name", "index"), sep = "_", fill = "right") %>%
mutate(index = ifelse(is.na(index), 1, index)) %>%
pivot_wider(names_from = "name", values_from = "value") %>%
ggplot(aes(y=studying , yend=studying , x=start, xend=end, color=studying)) +
geom_segment(size = 2)
To get overlaps you can use the valr
package. Since it is developed to find overlaps in DNA segments the data needs some transformation. Start end end are calculated using a cumsum week approach. Chrom is set to "1"
.
library(valr)
df %>%
as_tibble() %>%
mutate(start = week(start_dates) + (year(start_dates)-min(year(start_dates)))*52,
end = week(end_dates) + (year(end_dates)-min(year(end_dates)))*52,
chrom="1",
index=1:n()) %>%
valr::bed_intersect(., .) %>%
filter(studying.x != studying.y) %>%
# filter duplicated intervals out
mutate(index = paste(index.x, index.y) %>% str_split(., " ") %>% map(sort) %>% map_chr(toString)) %>%
filter(duplicated(index))
# A tibble: 5 x 15
studying.x start_dates.x end_dates.x start.x end.x chrom index.x studying.y start_dates.y end_dates.y start.y end.y index.y .overlap index
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
1 period3 1583193600 1590624000 61 74 1 3 period2 1579219200 1583193600 55 61 2 0 2, 3
2 period4 1590624000 1607558400 74 102 1 4 period3 1583193600 1590624000 61 74 3 0 3, 4
3 period5 1607558400 1620345600 102 123 1 5 period4 1590624000 1607558400 74 102 4 0 4, 5
4 period6 1620345600 1641254400 123 157 1 6 period5 1607558400 1620345600 102 123 5 0 5, 6
5 period7 1641254400 1642550400 157 159 1 7 period6 1620345600 1641254400 123 157 6 0 6, 7
Read more here: Source link