Any alternatives to the ifelse?

2

I am facing a problem that I do not know how to solve. I have 2 (very large) dataframes of DIFFERENT dimensions that contain meteorological station information on a daily basis for 15 years. The problem I have is that the same station has a different name in each df and I must unify everything in a single df. I will try to explain myself with a tiny example:

df1 # Este df tiene "escritas" todas las fechas posibles (2000-01-01 al 2015-31-12), pero realmente sólo registró Temp y humedad del 2000-01-01 al 2008-31-12

ID       Lat      Lon         Alt     Fecha          Temp    Humedad
C065    43.2112   -3.40713    677     2000-01-01     25       50
C065    43.2112   -3.40713    677     2000-02-01     12       55
C065    43.2112   -3.40713    677     2000-03-01     18       58 
C065    43.2112   -3.40713    677     2015-01-01     9999     9999 #Esto significa que no tiene registro
C054    43.0423   -2.65763    556     2000-01-01     12       35
C054    43.0423   -2.65763    556     2000-02-01     9        35
C054    43.0423   -2.65763    556     2000-03-01     10       38
C054    43.0423   -2.65763    556     2015-01-01     9999     9999
C035    42.9661   -2.86795    618     2000-12-13     2        25
C035    42.9661   -2.86795    618     2000-12-14     4        22
C035    42.9661   -2.86795    618     2000-12-14     8        27

df2 # Este df tiene "escritas" todas las fechas posibles (2000-01-01 al 2015-31-12), pero realmente sólo registró Temp y humedad del 2009-01-01 al 2015-31-12  

ID       Lat      Lon         Alt     Fecha          Temp    Humedad
GH65    43.2112   -3.40713    677     2000-01-01     9999    9999
GH65    43.2112   -3.40713    677     2009-01-01     35       60
GH65    43.2112   -3.40713    677     2009-02-01     32       55
GH65    43.2112   -3.40713    677     2015-01-01     28       58 
GH54    43.0423   -2.65763    556     2015-01-12     22       62
GH54    43.0423   -2.65763    556     2015-02-12     28       65
GH54    43.0423   -2.65763    556     2015-03-12     30       68

As it is observed, C065 and GH65 are the same station, since they are in the same coordinates. Now, what I have to do is to unify with a single name those stations that coincide in coordinates, either by calling them C065 or GH 65. I should also be able to make them complement, that is, from C065 I only take the real records (2000 to 2008) and of the GH65 also (2009-2015) It had occurred to me to do it using ifelse

df1$ID2 <- ifelse(df1$Lat == df2$Lat & df1$Lon == df2$Lon & df1$Alt == df2$Alt, df1$ID ,"NO")

but since the two df do not have the same number of records, it will not work. Any suggestions to solve it? I'll have to make a loop?

I hope I have understood, I know it is half complicated to explain the data I have and what I want to do.

Thanks in advance for any ideas.

Caro

    
asked by Caro 04.09.2018 в 10:51
source

2 answers

1

First, we reproduce your example:

df1 <- read.table(text= "ID       Lat      Lon         Alt     Fecha          Temp    Humedad
C065    43.2112   -3.40713    677     2000-01-01     25       50
C065    43.2112   -3.40713    677     2000-02-01     12       55
C065    43.2112   -3.40713    677     2000-03-01     18       58 
C065    43.2112   -3.40713    677     2015-01-01     9999     9999
C054    43.0423   -2.65763    556     2000-01-01     12       35
C054    43.0423   -2.65763    556     2000-02-01     9        35
C054    43.0423   -2.65763    556     2000-03-01     10       38
C054    43.0423   -2.65763    556     2015-01-01     9999     9999
C035    42.9661   -2.86795    618     2000-12-13     2        25
C035    42.9661   -2.86795    618     2000-12-14     4        22
C035    42.9661   -2.86795    618     2000-12-14     8        27", header = T, stringsAsFactors = F)

df2 <- read.table(text= "ID       Lat      Lon         Alt     Fecha          Temp    Humedad
GH65    43.2112   -3.40713    677     2000-01-01     9999    9999
GH65    43.2112   -3.40713    677     2009-01-01     35       60
GH65    43.2112   -3.40713    677     2009-02-01     32       55
GH65    43.2112   -3.40713    677     2015-01-01     28       58 
GH54    43.0423   -2.65763    556     2015-01-12     22       62
GH54    43.0423   -2.65763    556     2015-02-12     28       65
GH54    43.0423   -2.65763    556     2015-03-12     30       68", header = T, stringsAsFactors = F)

Then, using tidyverse the idea is:

  • Combine the two data.frame into one only
  • Build the new ID, for this we create a new df with Lat , Lon and Alt , the new ID will be the first one that appears in each group
  • Finally we add the new ID to the data

Solution:

library(tidyverse)
library(lubridate)

# unimos ambos df en uno nuevo
# y de paso convertimos Fecha a Date
df1 %>%
    union(df2) %>%
    mutate(Fecha = as.Date(Fecha)) %>%
    as.tibble() -> newdf

# Creamos un nuevo df para generar los grupos
# Generamos un nuevo id con el primer valor de cada grupo
newdf %>%
    group_by(Lat, Lon, Alt) %>%
    summarise(newid=first(ID)) -> grupos

# Genermos el df final con el nuevo id
newdf %>%
    left_join(grupos, by=c("Lat", "Lon", "Alt")) %>%
    select(newid, everything()) %>%
    arrange(newid) -> newdf

The new data.frame will have a new column newid that represents the ID unified, for example:

newdf %>%
    filter(newid=='GH65')

# A tibble: 8 x 8
  newid ID      Lat   Lon   Alt Fecha       Temp Humedad
  <chr> <chr> <dbl> <dbl> <int> <chr>      <int>   <int>
1 GH65  GH65   43.2 -3.41   677 2015-01-01    28      58
2 GH65  GH65   43.2 -3.41   677 2009-02-01    32      55
3 GH65  GH65   43.2 -3.41   677 2009-01-01    35      60
4 GH65  GH65   43.2 -3.41   677 2000-01-01  9999    9999
5 GH65  C065   43.2 -3.41   677 2015-01-01  9999    9999
6 GH65  C065   43.2 -3.41   677 2000-03-01    18      58
7 GH65  C065   43.2 -3.41   677 2000-02-01    12      55
8 GH65  C065   43.2 -3.41   677 2000-01-01    25      50

Filtering later by date is trivial using the functionality of tidyverse/dplyr and thanks to that we have previously converted Fecha to Date :

newdf %>%
    filter(year(Fecha) >= 2009 & year(Fecha) <= 2015 & ID == 'GH65')
    
answered by 07.09.2018 в 02:29
0

Using the sqldf package can be solved with the convenient SQL syntax to do JOIN.

library("sqldf")
dfout <- sqldf("SELECT * FROM df1 JOIN df2 ON df1.Lat=df2.Lat AND df1.Lon=df2.Lon AND df1.Alt=df2.Alt")

The problem is that you have several entries on the same date of the same station, so it will not do well the JOIN.

    
answered by 04.09.2018 в 17:01