How I Removed 11TB of Media

We leverage a media content management platform called Kaltura at work for people to publish their recorded presentations. We had been ingesting everyone’s recorded meeting content for 2 years, indiscriminately, and had accumulated about 15 TB of mostly unused recordings.

We had two objectives. Get our recordings under control, and maintain the removal of content so we don’t have a build up in the future. And we needed to retain all published content.
I thought this was going to be easy, and probably would be for someone who’s not as technologically challenged as I am, so I got started.

Objective one: GET RECORDINGS UNDER CONTROL
At first, I wanted to potentially leverage Python and the libraries Kaltura had built to write my cleanup scripts. After struggling with my limited Python knowledge (and experiencing weird behavior with the simplest queries), I decided to go back to my old stand by, BASH.

After finally getting some simple queries working (I struggled making a session with the appropriate level of rights), I was ready to make some advancements.

The following blocks kicks off your session, and creates a token you use when interacting with API, later. The token is just a string that defines “KALTURA_SESSION”.

#if you want to run the script, just enter your secret and partnerId
KALTURA_SESSION=$(curl -X POST https://www.kaltura.com/api_v3/service/session/action/start \
    -d "secret=" \
    -d "type=2" \
    -d "partnerId=" \
    -d "expiry=86400" \
    -d "format=1" | sed 's@"@@g')

I needed a way to only grab media content that was of a certain age. Kaltura’s API has built in filters, so I used them.
By default, Kaltura gives you back information for 30 entries at a time, so I decided to page by the i = i + 1 method (in this case, I used the unhelpful mnemonic “num”).
I also needed to tell the loop to stop at some point, so I decided to define a variable by a resultant grep. I knew that when I ran the command to define the variable “res”, the string “msDuration” would exist in the output if there were any media entries on that page.
Then, at the end each loop, I write a bunch of text out to a text file that I get to parse through later. I know there are better ways to deal with large amounts of data, but I’m a glutton for plaintext work with commands like sed, grep, and awk (no awk in this script, though).

curTime=$(date +%s)
time=$(($curTime - 15552000))
num="1"
pager="id"

while [ -n "$pager" ];
do
    res=$(curl -X POST https://www.kaltura.com/api_v3/service/media/action/list \
        -d "ks=$KALTURA_SESSION" \
        -d "filter[createdAtLessThanOrEqual]=$time" \
        -d "filter[objectType]=KalturaMediaEntryFilter" \
        -d "pager[pageIndex]=$num" \
        -d "pager[objectType]=KalturaFilterPager")
    num=$(($num + 1))
    pager=$(echo "$res" | grep "msDuration")

    echo $res | sed 's|<|\n<|g' | grep -A20 "<id>" >> $tmpdir/raw.txt
done

Next, I pulled the two lines of each media entry’s information that I needed (media ID and the category names that media is in).

while read line
do 
    echo $line | grep "<id>" >> $tmpdir/idcat.txt
    echo $line | grep "<categories>" >> $tmpdir/idcat.txt
done < $tmpdir/raw.txt

Then, I remove some unnecessary text, and with some sed magic I remove media IDs of content we don’t want to lose (anything that’s published). In BASH “-n” refers to if a variable in “non-zero”. “-z” can be used to check if a variable is “zero” value.

while read line
do
    id=$(echo $line | grep "<id>")
        if [[ -n $id ]]; then
            echo $line | sed 's/<id>//g' >> $tmpdir/flagthese.txt
        fi
    published=$(echo $line | grep -i "moodle")
        if [[ -n $published ]]; then
            sed -i '' -e '$ d' $tmpdir/flagthese.txt
#            echo "deleting last line for entry $entry"
		pub=$(($pub + 1))
	else
		unpub=$(($unpub + 1))
        fi
    entry=$(($entry + 1))
done < $tmpdir/idcat.txt

At the end of this process, I have a nice text file with a list of media IDs that I can then throw through a deletion script.

Objective two: Allow this to run on a schedule to avoid a buildup in the future
Now that the scripts were written, I could just place the script on a server to run in a cron (or it could run from my Mac, too). I have two scripts doing this work, but you could certainly combine the deletion script into the script that creates a list of media to delete.
We wanted to delete content over 180 days old, so instead of hardcoding a date value that would need to be updated each time, you can do something like this to define a date variable. (15552000 is 180 days in seconds)

curTime=$(date +%s)
time=$(($curTime - 15552000))

Leave a Reply Cancel reply