Ordenar datos

Introducción

Ahora que sabemos importar nuestros datos brutos, el siguiente paso es transformarlos en algo manejable. Para este apartado, vamos a usar nuestro propio set de datos: brca.txt, que contiene información analítica sobre diferentes casos de cáncer de mama, como por ejemplo la edad del paciente, subtipo clínico, subtipo molecular, número de ganglios linfáticos, radioterapia, entre otros (puedes mirar todas las variables en el fichero original).

Además, vamos a emplear un paquete de funciones nuevo, hablar, que no viene incluido ni en Tidyverse ni en R base. Este contiene una función muy útil que nos permite cambiar el tipo de dato asignado a una variable, tema que trataremos más adelante en el tutorial (si necesitas obtener más información, lee el manual de referencia). Lo instalamos y cargamos los paquetes Tidyverse y hablar:

1	`install.packages("hablar")`

1 2	`library(tidyverse) library(hablar)`

Una vez cargados los paquetes y descargado el archivo (que emplea tabulaciones como delimitadores), lo abrimos y le asignamos una variable:

1	`datos <- read_tsv("brca.txt")`

Podemos ver el dataset en la consola simplemente escribiendo el nombre de la variable:

datos

# A tibble: 11 x 1,098
   attrib_name   TCGA.3C.AAAU   TCGA.3C.AALI   TCGA.3C.AALJ   TCGA.3C.AALK   TCGA.4H.AAAK  TCGA.5L.AAT0 
   <chr>         <chr>          <chr>          <chr>          <chr>          <chr>         <chr>        
 1 years_to_bir~ 55             50             62             52             50            42           
 2 Tumor_purity  0.7886         0.6974         0.7666         0.6869         0.649         0.6501       
 3 pathologic_s~ NA             2              2              1              3             2            
 4 histological~ infiltratingl~ infiltratingd~ infiltratingd~ infiltratingd~ infiltrating~ infiltrating~
 5 number_of_ly~ 4              1              1              0              4             0            
 6 gender        female         female         female         female         female        female       
 7 radiation_th~ no             yes            no             no             no            yes          
 8 race          white          blackorafrica~ blackorafrica~ blackorafrica~ white         white        
 9 ethnicity     nothispanicor~ nothispanicor~ nothispanicor~ nothispanicor~ nothispanico~ hispanicorla~
10 Median_overa~ 1              1              0              0              0             0            
11 overall_surv~ 4047           4005           1474           1448           348           1477         
# ... with 1,091 more variables: TCGA.5L.AAT1 <chr>, TCGA.5T.A9QA <chr>, TCGA.A1.A0SB <chr>,...

Para visualizarlos en forma de tabla empleamos:

1	`View(datos)`

Debe aparecer algo similar a esto:

Orientación de los datos

Para considerar que un conjunto de datos está correctamente ordenado, este debe cumplir al menos las siguientes tres reglas:

Cada variable debe tener su propia columna
Cada caso debe tener su propia fila
Cada valor debe tener su propia celda

Lo primero de lo que podemos darnos cuenta es que la orientación de nuestro conjunto de datos es incorrecta, pues se muestran las variables en las filas y los diferentes casos en las columnas.

Esto puede solucionarse fácilmente transponiendo las filas y columnas. Sin embargo, Tidyverse no ofrece ninguna función para transponer tibbles. Por tanto, tenemos que usar la función t(), incluida en R base:

1	`t(datos)`

             [,1]             [,2]           [,3]               [,4]                           
attrib_name  "years_to_birth" "Tumor_purity" "pathologic_stage" "histological_type"            
TCGA.3C.AAAU "55"             "0.7886"       NA                 "infiltratinglobularcarcinoma" 
TCGA.3C.AALI "50"             "0.6974"       "2"                "infiltratingductalcarcinoma"  
TCGA.3C.AALJ "62"             "0.7666"       "2"                "infiltratingductalcarcinoma"  
TCGA.3C.AALK "52"             "0.6869"       "1"                "infiltratingductalcarcinoma"  
TCGA.4H.AAAK "50"             "0.649"        "3"                "infiltratinglobularcarcinoma" 
TCGA.5L.AAT0 "42"             "0.6501"       "2"                "infiltratinglobularcarcinoma" 
TCGA.5L.AAT1 "63"             "0.5553"       "4"                "infiltratinglobularcarcinoma" 
TCGA.5T.A9QA "52"             "0.8368"       "2"                "other,specify"                
TCGA.A1.A0SB "70"             "0.9328"       "1"                "other,specify"                
TCGA.A1.A0SD "59"             "0.6906"       "2"                "infiltratingductalcarcinoma"  
TCGA.A1.A0SE "56"             "0.7979"       "1"                "mixedhistology(pleasespecify)"
TCGA.A1.A0SF "54"             "0.7237"       "2"                "infiltratingductalcarcinoma"  

...

Esta función, sin embargo, transforma las tibbles en data frames de R estándar. Esto también puede solucionarse fácilmente usando as_tibble(). Vamos a definir una nueva variable para diferenciar los datos originales (variable datos) de los bien ordenados (variable brca), con los que vamos a trabajar a partir de ahora:

1	`brca <- as_tibble(t(datos), rownames = NA) # "rownames = NA" evita eliminar la columna de nombres`

Comprobamos como se encuentran ahora nuestros datos:

brca

# A tibble: 1,098 x 10
   V1       V2       V3        V4           V5        V6          V7       V8        V9         V10     
 * <chr>    <chr>    <chr>     <chr>        <chr>     <chr>       <chr>    <chr>     <chr>      <chr>   
 1 years_t~ Tumor_p~ patholog~ histologica~ number_o~ gender_and~ radiati~ ethnicity Median_ov~ overall~
 2 55       0.7886   NA        infiltratin~ 4         female/whi~ no       nothispa~ 1          4047    
 3 50       0.6974   2         infiltratin~ 1         female/bla~ yes      nothispa~ 1          4005    
 4 62       0.7666   2         infiltratin~ 1         female/bla~ no       nothispa~ 0          1474    
 5 52       0.6869   1         infiltratin~ 0         female/bla~ no       nothispa~ 0          1448    
 6 50       0.649    3         infiltratin~ 4         female/whi~ no       nothispa~ 0          348     
 7 42       0.6501   2         infiltratin~ 0         female/whi~ yes      hispanic~ 0          1477    
 8 63       0.5553   4         infiltratin~ 0         female/whi~ no       hispanic~ 0          1471    
 9 52       0.8368   2         other,speci~ NA        female/bla~ yes      nothispa~ 0          303     
10 70       0.9328   1         other,speci~ 0         female/whi~ NA       nothispa~ 0          259     
# ... with 1,088 more rows

Vemos que esta función ha asignado nombres predeterminados a las columnas traspuestas (V1, V2, V3...), dejando los nombres originales en una fila aparte. Corregimos los nombres y eliminamos la fila sobrante de esta manera:

1 2	`colnames(brca) <- brca[1,] # Cambia los nombres de las columnas brca <- brca[-1,] # Elimina la fila que contenía los nombres`

brca

# A tibble: 1,097 x 10
   years_to_birth Tumor_purity pathologic_stage histological_type   number_of_lymph_~ gender_and_race   
   <chr>          <chr>        <chr>            <chr>               <chr>             <chr>             
 1 55             0.7886       NA               infiltratinglobula~ 4                 female/white      
 2 50             0.6974       2                infiltratingductal~ 1                 female/blackorafr~
 3 62             0.7666       2                infiltratingductal~ 1                 female/blackorafr~
 4 52             0.6869       1                infiltratingductal~ 0                 female/blackorafr~
 5 50             0.649        3                infiltratinglobula~ 4                 female/white      
 6 42             0.6501       2                infiltratinglobula~ 0                 female/white      
 7 63             0.5553       4                infiltratinglobula~ 0                 female/white      
 8 52             0.8368       2                other,specify       NA                female/blackorafr~
 9 70             0.9328       1                other,specify       0                 female/white      
10 59             0.6906       2                infiltratingductal~ 0                 female/white      
# ... with 1,087 more rows, and 4 more variables: radiation_therapy <chr>, ethnicity <chr>,
#   Median_overall_survival <chr>, overall_survival <chr>

Los datos quedarían finalmente de la siguiente forma:

1	`View(brca)`

Importante:

Vamos a usar estos datos modificados brca hasta el final del apartado. Es posible que algún ejercicio no funcione si no habéis seguido los pasos que hemos hecho hasta ahora.

Modificar datos

Ahora que los datos se encuentran bien orientados, es más fácil ver los resultados y pensar qué modificaciones podemos hacer.

Nota

Ya hemos estudiado la manipulación básica de data frames en el tutorial básico anterior. Puedes mirarlo aquí.

Combinar columnas

Tidyverse ofrece una herramienta que nos permite unir variables en una sola columna en caso de que lo necesitemos. Para ello, empleamos el comando unite(), de la siguiente forma:

1	`unite(<DATOS>, <COL_NUEVA>, <COLUMNA1>, <COLUMNA2>, ..., sep = "<SEPARADOR>")`

Por ejemplo, imaginemos que necesitamos en una sola columna a edad del paciente y la pureza del tumor:

1	`ejemplo1 <- unite(brca, ejemplo1, years_to_birth, Tumor_purity, sep = " - ")`

1	`ejemplo1`

# A tibble: 1,097 x 9
   ejemplo1  pathologic_stage histological_type    number_of_lymph_~ gender_and_race    radiation_thera~
   <chr>     <chr>            <chr>                <chr>             <chr>              <chr>           
 1 55 - 0.7~ NA               infiltratinglobular~ 4                 female/white       no              
 2 50 - 0.6~ 2                infiltratingductalc~ 1                 female/blackorafr~ yes             
 3 62 - 0.7~ 2                infiltratingductalc~ 1                 female/blackorafr~ no              
 4 52 - 0.6~ 1                infiltratingductalc~ 0                 female/blackorafr~ no              
 5 50 - 0.6~ 3                infiltratinglobular~ 4                 female/white       no              
 6 42 - 0.6~ 2                infiltratinglobular~ 0                 female/white       yes             
 7 63 - 0.5~ 4                infiltratinglobular~ 0                 female/white       no              
 8 52 - 0.8~ 2                other,specify        NA                female/blackorafr~ yes             
 9 70 - 0.9~ 1                other,specify        0                 female/white       NA              
10 59 - 0.6~ 2                infiltratingductalc~ 0                 female/white       NA              
# ... with 1,087 more rows, and 3 more variables: ethnicity <chr>, Median_overall_survival <chr>,
#   overall_survival <chr>

Si no queremos eliminar las columnas originales, añadimos remove = FALSE al comando:

1	`ejemplo2 <- unite(brca, ejemplo2, years_to_birth, Tumor_purity, sep = " - ", remove = FALSE)`

1	`ejemplo2`

# A tibble: 1,097 x 11
   ejemplo2    years_to_birth Tumor_purity pathologic_stage histological_type        number_of_lymph_no~
   <chr>       <chr>          <chr>        <chr>            <chr>                    <chr>              
 1 55 - 0.7886 55             0.7886       NA               infiltratinglobularcarc~ 4                  
 2 50 - 0.6974 50             0.6974       2                infiltratingductalcarci~ 1                  
 3 62 - 0.7666 62             0.7666       2                infiltratingductalcarci~ 1                  
 4 52 - 0.6869 52             0.6869       1                infiltratingductalcarci~ 0                  
 5 50 - 0.649  50             0.649        3                infiltratinglobularcarc~ 4                  
 6 42 - 0.6501 42             0.6501       2                infiltratinglobularcarc~ 0                  
 7 63 - 0.5553 63             0.5553       4                infiltratinglobularcarc~ 0                  
 8 52 - 0.8368 52             0.8368       2                other,specify            NA                 
 9 70 - 0.9328 70             0.9328       1                other,specify            0                  
10 59 - 0.6906 59             0.6906       2                infiltratingductalcarci~ 0                  
# ... with 1,087 more rows, and 5 more variables: gender_and_race <chr>, radiation_therapy <chr>,
#   ethnicity <chr>, Median_overall_survival <chr>, overall_survival <chr>

Ejercicio

Prueba a combinar las variables pathologic_stage y radiation_therapy en una nueva columna llamada treatment_urgency, usando como separador " & " y sin eliminar las columnas originales.

Respuesta

ejercicio1 <- unite(brca, treatment_urgency,
pathologic_stage,
radiation_therapy,
sep = " & ",remove = FALSE
)

1	`ejercicio1`

# A tibble: 1,097 x 11
   years_to_birth Tumor_purity treatment_urgency pathologic_stage histological_type    number_of_lymph_~
   <chr>          <chr>        <chr>             <chr>            <chr>                <chr>            
 1 55             0.7886       NA & no           NA               infiltratinglobular~ 4                
 2 50             0.6974       2 & yes           2                infiltratingductalc~ 1                
 3 62             0.7666       2 & no            2                infiltratingductalc~ 1                
 4 52             0.6869       1 & no            1                infiltratingductalc~ 0                
 5 50             0.649        3 & no            3                infiltratinglobular~ 4                
 6 42             0.6501       2 & yes           2                infiltratinglobular~ 0                
 7 63             0.5553       4 & no            4                infiltratinglobular~ 0                
 8 52             0.8368       2 & yes           2                other,specify        NA               
 9 70             0.9328       1 & NA            1                other,specify        0                
 10 59             0.6906       2 & NA            2                infiltratingductalc~ 0                
 # ... with 1,087 more rows, and 5 more variables: gender_and_race <chr>, radiation_therapy <chr>,
 #   ethnicity <chr>, Median_overall_survival <chr>, overall_survival <chr>

Separar columnas

De la misma forma, podemos separar variables en varias columnas fácilmente empleando separate():

1	`separate(<DATOS>, <NOMBRE_COLUMNA>, into = c("<COL_NUEVA_1>", "<COL_NUEVA_2>"), sep = "<SEPARADOR>")`

Esta función es muy útil, pues permite separar las columnas empleando cualquier tipo de carácter como separador. Por ejemplo, en nuestro set de datos encontramos las variables "género" y "raza" en una misma columna llamada gender_and_race. Vamos a separarlas:

1	`ejemplo3 <- separate(brca, gender_and_race, into = c("gender", "race"), sep = "/")`

1	`ejemplo3`

# A tibble: 1,097 x 11
   years_to_birth Tumor_purity pathologic_stage histological_type    number_of_lymph_~ gender race      
   <chr>          <chr>        <chr>            <chr>                <chr>             <chr>  <chr>     
 1 55             0.7886       NA               infiltratinglobular~ 4                 female white     
 2 50             0.6974       2                infiltratingductalc~ 1                 female blackoraf~
 3 62             0.7666       2                infiltratingductalc~ 1                 female blackoraf~
 4 52             0.6869       1                infiltratingductalc~ 0                 female blackoraf~
 5 50             0.649        3                infiltratinglobular~ 4                 female white     
 6 42             0.6501       2                infiltratinglobular~ 0                 female white     
 7 63             0.5553       4                infiltratinglobular~ 0                 female white     
 8 52             0.8368       2                other,specify        NA                female blackoraf~
 9 70             0.9328       1                other,specify        0                 female white     
10 59             0.6906       2                infiltratingductalc~ 0                 female white     
# ... with 1,087 more rows, and 4 more variables: radiation_therapy <chr>, ethnicity <chr>,
#   Median_overall_survival <chr>, overall_survival <chr>

Reordenar columnas

Normalmente las variables se organizan de forma que aquellas que son más importantes se encuentran más hacia la izquierda. Trabajando podemos estimar que alguna variable es la más importante en nuestro caso y queremos que sea la que ocupa la primera columna.

Esto es muy sencillo de arreglar empleando la función relocate(), con la que podemos elegir la columna que queremos ver más a la izquierda:

1	`relocate(<DATOS>, <VARIABLE>)`

Por ejemplo, mirando nuestros datos ordenados estimamos que la variable histological_type es la más importante para nuestro estudio. La reordenamos de la siguiente forma:

1	`ejemplo4 <- relocate(brca, histological_type)`

1	`ejemplo4`

# A tibble: 1,097 x 10
   histological_type   years_to_birth Tumor_purity pathologic_stage number_of_lymph_~ gender_and_race   
   <chr>               <chr>          <chr>        <chr>            <chr>             <chr>             
 1 infiltratinglobula~ 55             0.7886       NA               4                 female/white      
 2 infiltratingductal~ 50             0.6974       2                1                 female/blackorafr~
 3 infiltratingductal~ 62             0.7666       2                1                 female/blackorafr~
 4 infiltratingductal~ 52             0.6869       1                0                 female/blackorafr~
 5 infiltratinglobula~ 50             0.649        3                4                 female/white      
 6 infiltratinglobula~ 42             0.6501       2                0                 female/white      
 7 infiltratinglobula~ 63             0.5553       4                0                 female/white      
 8 other,specify       52             0.8368       2                NA                female/blackorafr~
 9 other,specify       70             0.9328       1                0                 female/white      
10 infiltratingductal~ 59             0.6906       2                0                 female/white      
# ... with 1,087 more rows, and 4 more variables: radiation_therapy <chr>, ethnicity <chr>,
#   Median_overall_survival <chr>, overall_survival <chr>

Ejercicio

Con esta función podemos colocar las columnas en las posiciones que queramos, no solamente en la primera columna. Intenta mover la variable ethnicity a la 3ª columna.

Pista 1

Puedes usar alguno de estos dos comandos dentro de relocate(): .before = o .after =.

Pista 2

Trata de usar el comando de una de las siguientes formas:

1
2
3

relocate(<DATOS>, <VARIABLE>, 
   .before = <VARIABLE_ANTES>
)

1
2
3

relocate(<DATOS>, <VARIABLE>, 
   .after = <VARIABLE_DESPUES>
)

Respuesta

1
2
3

ejercicio2 <- relocate(brca, ethnicity, 
    .before = pathologic_stage
 )

También sirve:

1
2
3

ejercicio2 <- relocate(brca, ethnicity, 
    .after = Tumor_purity
 )

En ambos casos obtenemos:

1	`ejercicio2`

# A tibble: 1,097 x 10
   years_to_birth Tumor_purity ethnicity      pathologic_stage histological_type      number_of_lymph_n~
   <chr>          <chr>        <chr>          <chr>            <chr>                  <chr>             
1 55             0.7886       nothispanicor~ NA               infiltratinglobularca~ 4                 
2 50             0.6974       nothispanicor~ 2                infiltratingductalcar~ 1                 
3 62             0.7666       nothispanicor~ 2                infiltratingductalcar~ 1                 
4 52             0.6869       nothispanicor~ 1                infiltratingductalcar~ 0                 
5 50             0.649        nothispanicor~ 3                infiltratinglobularca~ 4                 
6 42             0.6501       hispanicorlat~ 2                infiltratinglobularca~ 0                 
7 63             0.5553       hispanicorlat~ 4                infiltratinglobularca~ 0                 
8 52             0.8368       nothispanicor~ 2                other,specify          NA                
9 70             0.9328       nothispanicor~ 1                other,specify          0                 
10 59             0.6906       nothispanicor~ 2                infiltratingductalcar~ 0                 
# ... with 1,087 more rows, and 4 more variables: gender_and_race <chr>, radiation_therapy <chr>,
#   Median_overall_survival <chr>, overall_survival <chr>

Separar datos por factores

En algunos casos vamos a necesitar separar nuestro set de datos en función de una de sus variables tipo factor. Para ello, empleamos la función split():

1	`split(<DATOS>, <VARIABLES>)`

En nuestro caso de ejemplo, imaginemos que necesitamos separar nuestros datos en diferentes tibbles en función del tipo de tumor (variable histological_type). Lo hacemos creando una variable de la siguiente forma:

1	`tipo_tumor <- split(brca, brca$histological_type)`

A esta variable se le asigna una lista similar a la siguiente:

De este modo, podemos estudiar uno de los factores sin tener en cuenta los demás, Veamos, por ejemplo, únicamente los datos correspondientes a aquellos carcinomas de origen medular (medullarycarcinoma):

1	`View(tipo_tumor$medullarycarcinoma)`

Cambiar tipo de datos

Al importar los datos, transformarlos a tibble y modificarlos hemos cambiado su naturaleza. Esto puede comprobarse debajo de los nombres de las columnas, donde vemos que las funciones de coversion asignadas no coinciden con el tipo de dato.

Esto puede darnos problemas a la hora de realizar cálculos estadísticos, modificaciones más complejas y representaciones gráficas, por lo que es conveniente transformarlos al tipo de dato correspondiente.

Nota

Es necesario antes conocer qué significan cada una de las funciones de conversión de datos. A continuación, se muestran las más comunes:

FUNCIONES	TIPO DE DATO
`<chr>`	Carácter
`<num>`	Numérico
`<int>`	Entero
`<lgl>`	Lógico
`<fct>`	Factor
`<dte>`	Fecha
`<dtm>`	Fecha y hora
`<dbl>`	Decimales

Usemos de nuevo nuestros datos. Muchas de nuestras variables son numéricas. Sin embargo, a todas se les ha asignado el tipo carácter. Podemos arreglar esto empleando la función convert(), perteneciente al paquete de datos hablar:

1	`convert(<DATOS>, <FUNCIÓN_CONVERSIÓN>(<COLUMNAS>))`

Vemos que muchas de las variables (years_to_birth, number_of_lymph_nodes, Median_overall_survival y overall_survival) son números enteros (<int>) pero están clasificados como carácter (<chr>). Arreglamos esto de forma sencilla:

ejemplo4 <- convert(brca, int(
   years_to_birth,
   number_of_lymph_nodes,
   Median_overall_survival,
   overall_survival
   )
)

1	`ejemplo4`

# A tibble: 1,097 x 10
   years_to_birth Tumor_purity pathologic_stage histological_type   number_of_lymph_~ gender_and_race   
            <int> <chr>        <chr>            <chr>                           <int> <chr>             
 1             55 0.7886       NA               infiltratinglobula~                 4 female/white      
 2             50 0.6974       2                infiltratingductal~                 1 female/blackorafr~
 3             62 0.7666       2                infiltratingductal~                 1 female/blackorafr~
 4             52 0.6869       1                infiltratingductal~                 0 female/blackorafr~
 5             50 0.649        3                infiltratinglobula~                 4 female/white      
 6             42 0.6501       2                infiltratinglobula~                 0 female/white      
 7             63 0.5553       4                infiltratinglobula~                 0 female/white      
 8             52 0.8368       2                other,specify                      NA female/blackorafr~
 9             70 0.9328       1                other,specify                       0 female/white      
10             59 0.6906       2                infiltratingductal~                 0 female/white      
# ... with 1,087 more rows, and 4 more variables: radiation_therapy <chr>, ethnicity <chr>,
#   Median_overall_survival <int>, overall_survival <int>

Podemos aplicar varios tipos de datos a diferentes variables. Por ejemplo, la variable Tumor_purity es una variable decimal (<dbl>):

ejemplo5 <- convert(brca,
   int(
      years_to_birth,
      number_of_lymph_nodes,
      Median_overall_survival,
      overall_survival
   ),
   dbl(
      Tumor_purity
   )
)

1	`ejemplo5`

# A tibble: 1,097 x 10
   years_to_birth Tumor_purity pathologic_stage histological_type   number_of_lymph_~ gender_and_race   
            <int>        <dbl> <chr>            <chr>                           <int> <chr>             
 1             55        0.789 NA               infiltratinglobula~                 4 female/white      
 2             50        0.697 2                infiltratingductal~                 1 female/blackorafr~
 3             62        0.767 2                infiltratingductal~                 1 female/blackorafr~
 4             52        0.687 1                infiltratingductal~                 0 female/blackorafr~
 5             50        0.649 3                infiltratinglobula~                 4 female/white      
 6             42        0.650 2                infiltratinglobula~                 0 female/white      
 7             63        0.555 4                infiltratinglobula~                 0 female/white      
 8             52        0.837 2                other,specify                      NA female/blackorafr~
 9             70        0.933 1                other,specify                       0 female/white      
10             59        0.691 2                infiltratingductal~                 0 female/white      
# ... with 1,087 more rows, and 4 more variables: radiation_therapy <chr>, ethnicity <chr>,
#   Median_overall_survival <int>, overall_survival <int>

De esta forma hemos solucionado muchos problemas en el futuro por trabajar con datos erróneamente clasificados.

Ejercicios

Ejercicios de repaso

Para ver si has entendido todo, intenta realizar estos ejercicios propuestos. Para saber cómo hacerlos, visita el apartado de "Realización de ejercicios". Tras esto, ejecuta el siguiente comando:

1	`learnr::run_tutorial("ordenar", "tutoradvr")`

Se abrirá una ventana en tu navegador con los ejercicios a resolver.

Referencias