Querying GA4 Session Last Non-direct Traffic Source in BigQuery

Hello! I recently found a solution to one of GA4 BigQuery export’s limitations – the lack of session-level traffic source data. I will share how I recreated GA4 session traffic source dimensions using BigQuery export event data, resulting in a lookup table containing the last non-direct source of traffic for each unique session id.

The Basics: Traffic Source Data in BigQuery

GA4 export can be confusing, especially with the traffic_source parameters. Remember that traffic_source refers to the user-level first-touch traffic source, not the session’s traffic source. Individual events in BigQuery can contain event parameters like source, medium, and campaign. These are nested within the event_params field of the event.

Defining Session Traffic Source in GA4

In Universal Analytics, traffic source dimensions were straightforward. However, GA4 is different – a change in the source of traffic no longer triggers a new session. GA4 reports the session source and medium by tracking the first event’s traffic source as the source for the entire session, excluding automatically generated first_visit and session_start events.

Last Non-direct Attribution in GA4

GA4’s approach for last non-direct attribution, or Cross-channel last click, is slightly different. It retrieves data from the last event containing traffic source details, rather than looking at the first event of each preceding session.

Query Process

I split the process into two SQL queries:

  1. Create a table with the first and last traffic source per session.
  2. Identify the last non-direct source of traffic if the current session had no traffic source data.

This helps make the queries less expensive.

Querying the Session Traffic Source

The first SQL query goes through the events and returns the first and last traffic source per unique session id. The values will be null if the traffic source data doesn’t exist. Save the query’s results in a new table and reference it in the next query instead of rerunning the entire thing with all event data.

Identifying the Last Non-direct Traffic Source

The second query uses the results of the first query to fill in the last non-direct traffic source if the session had no traffic source data.

Now you can recreate the GA4 session traffic source dimensions using BigQuery export event data and obtain the last non-direct source of traffic for each unique session id. I’ve added an example of this code below for you to copy and adapt. Happy querying!

GA4 BigQuery Example SQL

Query the Session Traffic Source

The first SQL query iterates through the events, returning the first and last traffic source for each unique session ID. If traffic source data is unavailable, the values will be null.

To avoid rerunning the entire query with all event data, save the results in a new table and reference it in the subsequent query.

Here’s the optimized session traffic source query:

-- Extract event data for session traffic source details
WITH events AS (
  SELECT
    CAST(event_date AS DATE FORMAT 'YYYYMMDD') AS date,
    CONCAT(user_pseudo_id, (SELECT value.int_value FROM unnest(event_params) WHERE key = 'ga_session_id')) AS session_id,
    user_pseudo_id,
    (SELECT value.int_value FROM unnest(event_params) WHERE key = 'ga_session_id') AS session_start,
    (SELECT value.string_value FROM unnest(event_params) WHERE key = 'source') AS source,
    (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') AS medium,
    (SELECT value.string_value FROM unnest(event_params) WHERE key = 'campaign') AS campaign,
    event_timestamp
  FROM
    `<project>.<dataset>.events_*`
  WHERE
    (_table_suffix >= '<start date>' AND _table_suffix <= '<end date>')
    AND event_name NOT IN ('session_start', 'first_visit')
)
SELECT
  date,
  session_id,
  user_pseudo_id,
  session_start,
  FIRST_VALUE(
    IF(
      COALESCE(source, medium, campaign) IS NOT NULL,
      (SELECT AS STRUCT source, medium, campaign),
      NULL
    )
  ) OVER(
    PARTITION BY session_id
    ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS session_first_traffic_source,
  LAST_VALUE(
    IF(
      COALESCE(source, medium, campaign) IS NOT NULL,
      (SELECT AS STRUCT source, medium, campaign),
      NULL
    ) IGNORE NULLS
  ) OVER(
    PARTITION BY session_id
    ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS session_last_traffic_source,
  ROW_NUMBER() OVER(
    PARTITION BY session_id
    ORDER BY event_timestamp
  ) AS session_event_number
FROM
  events
QUALIFY session_event_number = 1
-- Select only one row per session

This query extracts the necessary event data for session traffic source details, resulting in an optimized and efficient session traffic source query.

Querying the Last Non-direct Traffic Source

The second SQL query retrieves session-level data from the previously created table and fills in missing traffic sources with the last non-direct source. If session_first_traffic_source doesn’t contain any traffic source data, the query searches for the same user’s previous non-null value for session_last_traffic_source.

Here’s the optimized last non-direct traffic source query:

SELECT
  date,
  session_id,
  user_pseudo_id,
  session_start,
  session_first_traffic_source,
  IFNULL(
    session_first_traffic_source,
    LAST_VALUE(session_last_traffic_source IGNORE NULLS) OVER(
      PARTITION BY user_pseudo_id
      ORDER BY session_start RANGE BETWEEN 7776000 PRECEDING AND CURRENT ROW -- 90 day lookback
    )
  ) AS session_traffic_source_last_non_direct
FROM
  `<project>.<dataset>.<session traffic source table>`

This query has a 90-day lookback (7776000 seconds), which you can adjust by changing the number of days specified in seconds. The window order and range are based on session_start, corresponding to the ga_session_id parameter value. Ga_session_id is a Unix timestamp assigned at the beginning of a session.

Keep in mind that a larger lookback window can find traffic sources further back in time but will also increase the query’s cost.

Further Considerations

The query performed well on my test dataset, and the base logic seems accurate. However, there were a few common issues absent from the dataset I used.

Google Ads and GCLID Bug

GA4 has some challenges when it comes to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging utilizes the GCLID parameter to link website visits to their respective campaigns. In theory, this eliminates the need for the standard utm tags.

Unfortunately, this doesn’t always function as expected with BigQuery exports. In BigQuery, sessions that begin with an event containing the GCLID parameter may either lack source and medium details or display incorrect information such as “google / organic” or “youtube.com / referral” for their source medium.

Leave a Reply

Your email address will not be published. Required fields are marked *