Plotting date intervals in ggplot2

I have a dataset which has a bunch of date intervals (i.e. POSIXct format start dates and end dates).

In the example provided, let’s say it’s each period is associated to when someone was in school or out of school. I’m interested in plotting the data in ggplot2, each row is essentially data for one period. Currently all of the rows don’t have a factor variable, but I’ve put one in the example as it may make things easier to plot. It’s worth noting that in some cases the end date of one period and the beginning of the next overlap.

In the data, each row is a unique stint in school associated to a specific period. I’m interested in creating a sequence of weeks (from the first week to the last week in dataset) in the x axis and on the y axis I want just either a dot for each week to signify whether the person was in school (also identifying which stint) or out of school (even a gap perhaps would suffice). Thus perhaps an 8 level factor is needed in this case, one for each period, and a level for out of school (or perhaps no level is needed for when out of school)?

So in this case we could envisage having 7 rows of dots on the y axis, something (very loosely) like this (this example has many gap in lines, but I expect few or no gaps).

enter image description here

I envisaged the process to be something like: create a sequence from min(start_date) to max(end_date), join rows to this. Then somehow identify each period and create a factor variable for each period. Then plot the factor variable (e.g. period1, period2, period3) against the sequence of dates. I haven’t been able to do this though as it’s quite fiddly.

Looking at the lubridate package I was thinking that using interval() and %within% might be the solution but I wasn’t sure.

library(tidyverse)
library(lubridate)
                              
start_dates = ymd_hms(c("2019-05-08 00:00:00",
                        "2020-01-17 00:00:00",
                        "2020-03-03 00:00:00",
                        "2020-05-28 00:00:00",
                        "2020-12-10 00:00:00",
                        "2021-05-07 00:00:00",
                        "2022-01-04 00:00:00"), tz = "UTC")
  
end_dates = ymd_hms(c( "2019-10-24 00:00:00",
                       "2020-03-03 00:00:00", 
                       "2020-05-28 00:00:00",
                       "2020-12-10 00:00:00",
                       "2021-05-07 00:00:00",
                       "2022-01-04 00:00:00",
                       "2022-01-19 00:00:00"), tz = "UTC") 

df = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)
 

You can try

df %>% 
  ggplot() + 
   geom_segment(aes(x = start_dates, xend = end_dates, y =studying, yend = studying, color = studying), size=3) + 
  geom_segment(aes(x = start_dates, xend = start_dates, y =0, yend = studying))+
  geom_segment(aes(x = end_dates, xend = end_dates, y =0, yend = studying))

enter image description here

Per wwek as you asked in the comments

df %>% 
  as_tibble() %>%
  mutate(start = week(start_dates),
         end = week(end_dates)) %>% 
  mutate(gr = start>end, 
         start_2 = ifelse(gr, 0, NA),
         end_2 = ifelse(gr, end, NA),
         end = ifelse(gr, 52, end)) %>% 
  select(-2:-3, -gr) %>% 
  pivot_longer(-1) %>% 
  filter(!is.na(value)) %>% 
  separate(col = name, into = c("name", "index"), sep = "_", fill = "right") %>%  
  mutate(index = ifelse(is.na(index), 1, index)) %>% 
  pivot_wider(names_from = "name", values_from = "value") %>% 
  ggplot(aes(y=studying , yend=studying , x=start, xend=end, color=studying)) + 
   geom_segment(size = 2)

enter image description here

To get overlaps you can use the valr package. Since it is developed to find overlaps in DNA segments the data needs some transformation. Start end end are calculated using a cumsum week approach. Chrom is set to "1".

library(valr)
df %>% 
  as_tibble() %>%
  mutate(start = week(start_dates) + (year(start_dates)-min(year(start_dates)))*52,
         end = week(end_dates) + (year(end_dates)-min(year(end_dates)))*52,
         chrom="1", 
         index=1:n()) %>%  
  valr::bed_intersect(., .) %>% 
  filter(studying.x != studying.y) %>% 
  # filter duplicated intervals out
  mutate(index = paste(index.x, index.y) %>% str_split(., " ") %>% map(sort) %>% map_chr(toString)) %>% 
  filter(duplicated(index))

# A tibble: 5 x 15
  studying.x start_dates.x end_dates.x start.x end.x chrom index.x studying.y start_dates.y end_dates.y start.y end.y index.y .overlap index
  <chr>              <dbl>       <dbl>   <dbl> <dbl> <chr>   <int> <chr>              <dbl>       <dbl>   <dbl> <dbl>   <int>    <int> <chr>
1 period3       1583193600  1590624000      61    74 1           3 period2       1579219200  1583193600      55    61       2        0 2, 3 
2 period4       1590624000  1607558400      74   102 1           4 period3       1583193600  1590624000      61    74       3        0 3, 4 
3 period5       1607558400  1620345600     102   123 1           5 period4       1590624000  1607558400      74   102       4        0 4, 5 
4 period6       1620345600  1641254400     123   157 1           6 period5       1607558400  1620345600     102   123       5        0 5, 6 
5 period7       1641254400  1642550400     157   159 1           7 period6       1620345600  1641254400     123   157       6        0 6, 7 

Read more here: Source link