yingjie@memoir
Skip to content

2026-05-15

Primary aggregate metrics to satisfy: Issue response time, PR merge efficiency, contributor activity

  • Issue response time:

    • Initial idea: Time when repo admin replied - Time when issue was created
    • Realized I might be off track – what I'm looking at now is the issue's start time and close time, which doesn't align well with the goal… though closing counts as a response too. Better to combine it with PRs – essentially the time to solve the problem.
    • So: Time of repo admin's first reply - Time when issue was created
      • Can't get interaction metadata via gitlink-cli
      • Found that many data points seem problematic. The closed_on field is empty.
      • Seems I can only calculate using created_on and due_time
      • For some newer repos, created_at and closed_on in the issue details are usable
    • Current thought: Calculate average, median?
      • 50% completed within 1 day, 10% within 7 days, 10% >7 days…
      • Compute average time
  • PR merge efficiency: ?

    • What is efficiency? Efficiency p = work w / time t
    • Specifically: Number of merged PRs / time taken
    • Both PR detail and list info lack merge time…
    • Number of merged / number of opened
    • PR's merged_at is absent; maybe I can use issue_closed_at instead?
  • Contributor activity: Number of people who submitted PRs over a period?

    • Number of people who created PRs during a given period, determined by pr_full_time
users(id, user_name)
repos(id, repo_name)
issues(repo_id, creater_id, processor_id, create_time, close_time)
pulls(repo_id, creater_id, processor_id, create_time, close_time)

Thoughts

I discovered that gitlink-cli can output in table format, and I'm thinking whether I can store the data directly into a database. But I feel this might not effectively reduce complexity, because the data to be analyzed is limited; this would introduce other irrelevant data and increase overhead during ingestion.

I plan to first design the tables so that the target data goes into the corresponding fields, then just do the analysis. For example, for Issue response, the focus is on time – create a few time fields and store the relevant content from the JSON.

During the process, I've found that a lot of data is incomplete. What to do? Can't just give up. I think the reliable approach is to keep pushing forward, to explore the infrastructure, the methods, and the workflow. Even if the conditions aren't right now, they will be in the future.

Should I do queries? i.e., when does data cleaning happen? When extracting data, I can use queries to narrow the scope (discovered that neither the API docs nor CLI docs mention query parameters). Or I can define the scope during analysis.

At this point, I've basically determined the available data and the aggregate metrics that can be calculated. The next step is to design the database table structure.

Current tasks

  • Based on the aggregate metrics proposed officially, figure out what atomic data is needed
  • Build data analysis infrastructure
    • Collection scripts
    • Transformation
    • Store to SQLite

Conclusion

  • Issue response time
    • Get closed_on, created_at from issue details
    • Issue processing time: closed_on - created_at
  • PR merge efficiency
    • Get search_count, merged_issue_size, close_count from PR list
    • Merge rate: merged_issue_size / search_count
  • Contributor activity
    • Number of PRs submitted over a period

Prompt

prompt
Are my judgments accurate and my design feasible? I think only four tables – users, repos, issues, pulls – are needed to support the following tasks.

The approximate logic for the four tables is as follows:
users(id PK, user_name)

repos(id PK, repo_name, owner(users.id))

issues(id PK, repo_id(repos.id), number, creater_id(users.id), 
processor_id(users.id), create_time, close_time, status('close', 'open'))

pulls(id PK, repo_id(repos.id), number, creater_id(users.id), status('merged', 'close', 'open'), 
processor_id(users.id), create_time, close_time)

In the issues and pulls tables, `processor` means the person who finally completed the PR or Issue. In practice, the returned target field may not be called `processor`.

Task 1: Calculate issue resolution time. Use `gitlink-cli issue +view -owner OWNER -repo REPO -i ISSUE_NUMBER` to get issue details, which include `closed_on` and `created_at`. Compute the completion time of an issue by subtracting `created_at` from `closed_on`.
Task 2: Calculate PR merge rate. Use `gitlink-cli pr +list -owner OWNER -repo REPO` to get the target repo's PR list. The returned data includes `search_count`, `merged_issue_size`, `close_count`. Compute the PR merge rate as `merged_issue_size` divided by `search_count`.
Task 3: Calculate contributor activity by counting the number of PRs submitted over a period.

Note: For fetching PRs and Issues, you can use `--page` to control the page number and `--size` to control the number per page. The goal is to fetch all data from the repo.

Studying benchmark accounts

I found a blogger called "白日梦想家" (Daydreamer) with great Douyin stats, though I think I got to know him through "萝卜乔乔" (Radish Qiaoqiao). I've skimmed some analyses.

  • Some say he creates a down-to-earth persona, making friends with the audience.
  • Contrast? The title is "How to Splurge," but the videos show a very frugal – even homeless – daily life.
  • He speaks in a low voice, forcing the audience to focus. Radish Qiaoqiao's rapid-fire opening narration might serve a similar purpose.

Thoughts

I've already returned to China after my overseas studies. How to shoot? How to edit? I'm thinking of making a memoir – using accumulated footage, past VLOGs, and adding a contrast between my current thoughts and those from back then.