How to Use Data Science for Analyzing Endurance Athletes

Coaches must be able to analyze and understand an endless stream of data in order to draw conclusions; using data science can amplify your insights.

Coaches look at a lot of data inside and outside the TrainingPeaks platform. A coach must draw conclusions and infer various things from this endless stream of data. This is not always a simple task or easy thing to achieve.

Whether the data is directly measured from your athlete, a series of subjective answers to questions, or calculated from TrainingPeaks software (such as CTL), coaches must be familiar with techniques for analyzing and understanding this data. Let’s dive into why having a good grasp of statistics is important for coaches and point you toward resources that can help you become more data savvy.

Selecting a statistical metric or technique
Statistics is a vast field and finding an appropriate tool for the job can be challenging. Crudely speaking, there are two main branches of statistics. Descriptive statistics are used primarily for summarising or interpreting the data. Inferential statistics are mainly used to draw conclusions from the data.

Coaches use both in conjunction with the TrainingPeaks platform, whether they’re aware of it or not. Coaches on TrainingPeaks use descriptive statistics to determine an athlete’s fitness while getting a more complete picture of the individual, and inferential statistics for analyzing athletes among others and predicting race performances.

For coaches, one critical statistical feature to look for is a measure of central tendency. This value describes data by finding the center point of it — that could be from a ride, several sessions or results from a season. You may know the three methods used: mean, median and mode.

Mean — The mean is calculated by summing up the data and dividing it by the number of data points. This is the method used to calculate ‘average power’ in TrainingPeaks. A weighted mean is used to calculate normalized power.

Median — The median is calculated by arranging your data in a listed order and determining the value in the middle. This might be used in assessing the results of athletes who participated in a race. Perhaps even used by a coach for advertising purposes, such as a median improvement in FTP over the course of a year.

Mode — The modal value of a dataset is a value that appears the most often. This is useful in analyzing the success of an athlete’s racing season, for example. Their modal result would be the finishing position they achieved most often.

Comparing an Individual to a Population
One part of coaching is analyzing your athlete’s ability in the context of potential competitors during a race. An example of where this might be used is in assessing the performance gains an athlete needs to achieve to fulfill a goal, such as an age group podium or qualifying for a championship.

Doing a successful analysis means a coach needs to accurately figure out where this athlete fits amongst their competitors, work out what direction and at what rate the athlete is moving and work out shifts in mean performances from the rest of the field. In recent years, for example, mean finishing times have become faster due to technological advancements in running shoes and athletes paying more attention to aerodynamics on the bike.

Analyzing your athlete’s data amongst the population is vital for seeing these patterns early and allowing you and your athlete to come up with methods to get ahead of — or catch up with — the rest of the field.

The Denominator Matters
When doing any analysis comparing either one population to another or an individual to a population, the denominator is of vital importance. The aim is to accurately see where your athlete fits within the group they’re competing against. This allows realistic goal setting to enable your athlete to have a set of clear, achievable goals to keep them motivated during a training cycle.

Selecting the correct population can be as simple as comparing an athlete to their age group rather than professionals or comparing a triathlete against triathletes rather than pure cyclists when analyzing cycling performance in time trials. There are scenarios in which selecting the population to be compared against is less straightforward, but it should always be a task that’s considered carefully.

Performance Analysis Example
Let’s start with comparing an athlete’s swim performance against others in their age group. This might be a good example, as elapsed times alone in open water swimming can be meaningless, and GPS data is notoriously unreliable. If this athlete competes in five races during the year, all of similar, but varying slightly, standard, a coach might want to see how they have progressed by looking at two statistics. First, what is their percentage difference from the fastest swimmer and second, what’s their position in the field as a proportion of the number of finishers.

Using this method is often the best; with something like open water swimming, several factors can change race day performance such that one variable, for example, time from the winner, is not always a fair or accurate reflection of performance.

Let’s say over five races, an athlete achieves the following performance with the values written in order of date…

Percentage of winner’s time = 109%, 108%, 101%, 112%, 104%

Percentage of field beaten = 91%, 92%, 93%, 98%, 98%

Using only one of these variables provides an incomplete picture of any race, let alone the season as a whole. For example, in race three, our hypothetical athlete was closer to the winner yet beat a much smaller proportion of athletes than in races four and five. By just taking our athlete’s percentage of the winner’s time, we see an athlete who peaked in race three – yet in reality, our athlete has improved steadily during the year when the whole picture is revealed.

This is, of course, an oversimplified example to illustrate a point — the reality is that a thorough analysis of an athlete’s performance is time-consuming, and there are usually far more than two variables that go into deciding how an athlete is doing and what to do next. Understanding which variables to include and their impact on the assessment is a careful judgment a coach must make.

Resources to Learn Coding
Given where you’re reading this, it’s likely that you’re familiar with the analysis that TrainingPeaks can provide you as a coach. On occasion, you might wish to do something that TrainingPeaks doesn’t currently allow you to. You might also want to perform a similar analysis as we did during the worked example amongst a larger population or with more variables — which you’ll probably want the assistance of a programming language to find insights.

Here is a list of resources to help you start some simple coding and statistics to further develop your skill set as a coach.

TrainingPeaks University

TPU provides several courses which allow coaches to become familiar with the analytics available through the platform.

Kaggle

Kaggle provides many beginner-friendly free data science courses across various technologies. They also have some free datasets for you to analyze and practice with — many of these are endurance sport based, such as time datasets from major marathons.

•YouTube

There are countless technical resources on coding on the video platform, complete with example projects. It’s possible to go from zero to expert with just YouTube and practice. Some of my favorite options are Socratica, PyData and Real Python – though an exhaustive list is available here. Reading through the comments on a YouTube tutorial is generally a good marker of a video’s quality — if lots of people found the content helpful, you are more likely to also. Or, if there are comments questioning the process, proceed with caution.

Miscellaneous Resources

Away from those examples, there are several valuable resources I’ve picked up on over the years. “Towards Data Science” is always full of technical blogs. There’s also a great book called “Statistics Without Tears,” which aims to present complicated statistical ideas to non-mathematicians.

Finally, downloading the Anaconda software package onto your computer and just exploring Python, perhaps in conjunction with a YouTube tutorial, can be a great way to learn. Python is a ‘high-level’ programming language, meaning the language looks closer to written English than ones and zeros or unique software language. It’s a great first language to learn, as Python is one of the more popular tools used by data-centric professions, from astrophysicists to endurance training software developers.

Conclusion
While you don’t need to be the best data scientist to become a great triathlon coach, having a solid understanding of fundamental data analysis will allow you to bring another dimension to your coaching.

A skilled data scientist knows when the data isn’t showing anything. So, determining when to find other ways of making the best decision possible with the information available is much like balancing the art of coaching with sports science.

With the list of resources provided here and understanding the swimming example, you can step into the hybrid world of endurance coach and data analyst. And potentially join the ranks of triathlon coach Alan Couzens, founder of HRV4Training Marco Altini and Lululemon innovation team member Dr. Sian Allen.

Having a good grasp of statistics and how to use data science can allow coaches to interpret more data.